Fastly confirms software bug caused "broad and severe" outage

June 09, 2021 08:26 AM

Fastly has confirmed that a software bug was behind the outage that brought down its edge cloud platform yesterday.

However, the incident has prompted calls for more infrastructure to prevent future incidents.

Lasting approximately one hour, the outage affected Amazon, Reddit, CNN, and even the UK government's website. Ivo Ivanov, CEO of DE-CIX International said the event highlighted "the obvious need for further infrastructure".

"Unfortunately there are outages like this because infrastructure is infrastructure and physics is physics," Ivanov told Capacity.

"I don’t want to comment on single organisations because we are probably not aware of the full details. What I will comment on is the obvious need for further infrastructure, localising infrastructure, building bigger pipes, and creating more and more internet exchanges.

"This is what DE-CIX has committed to running in different regions, making sure that the distribution of traffic can be more and more localised; avoiding congestion on the big global highways, hand in hand with the localisation of applications and content. The more infrastructure can be localised, the better it is for the entire industry, so in terms of regional outages traffic can be easily rerouted and local effects will be minimised to zero," he continued.

Angelique Medina, director of product marketing, at Cisco ThousandEyes highlighted how concentrations in connectivity magnified – and multiplied – the problem.

She said: “By caching web content close to users for maximum performance and availability, CDNs are critical to how we access digital services. Together with major public cloud providers who host or provide services to the most-used sites and applications online, the delivery mechanism that is the internet is largely powered by a few providers and whenever something goes wrong with one of them, it can have a massive impact on web users globally.”

What happened?

In a blog post published later on 8 June US time, Fastly's SVP of engineering and infrastructure, Nick Rockwell, confirmed the reason for the outage: it was triggered "by a valid customer configuration change" following a software deployment on 12 May.

On June 8, a customer pushed a valid configuration change, which included "the specific circumstances that triggered the bug, which caused 85% of our network to return errors".

"We detected the disruption within one minute, then identified and isolated the cause, and disabled the configuration," Rockwell wrote. "Within 49 minutes, 95% of our network was operating as normal. This outage was broad and severe, and we’re truly sorry for the impact to our customers and everyone who relies on them," his post continued.

"A complete postmortem" will be conducted on the processes and practices that triggered the chain of events.

"We’ll figure out why we didn’t detect the bug during our software quality assurance and testing processes. We’ll evaluate ways to improve our remediation time," Rockwell pledged.

For its part, Fastly appears to have no issues with resilience. Although its share price initially dropped in reaction to the news it climbed overall and, at the time of writing, it was trading at US$56.20 a share, compared to $42.31 on 6 May.