The internet will fail, but you determine its impact
06 September 2021 | Neil Miller
Online, failure is a fact of life – and service provision. However, as Neil Miller, director, solutions engineering, Cisco ThousandEyes, writes, the severity of outages can be mitigated
On 8 June, across the internet, an “Error 503 service unavailable” note appeared on screens around the world. In today’s Internet-connected world, we are no strangers to the occasional outage, but this particular blackout was far-reaching and took out some of the world’s most visited sites and applications. Within 45 minutes, Fastly, a global Content Delivery Network (CDN) provider, issued a statement saying it had identified an issue. The outage lasted for about an hour. Later, Fastly identified a latent bug introduced in a software update as the cause.
But before the internet had time to stop talking about Fastly’s outage, less than two weeks later another CDN provider – this time, Akamai – also joined the dreaded downtime list with its DDoS mitigation service, Prolexic Routed, experiencing a service disruption. While not as severe in length or disruption as Fastly’s, it served as a clear reminder that the internet can tumble.
Twice in a month, companies that most people have never heard of showed how co-dependent our global internet infrastructure is. Yet, both outages didn’t affect all businesses in the same way. What became clear is that customers’ IT teams with a back-up plan informed by an understanding of the various building blocks that make up the internet were able to act and recover more quickly. So, what really happened when the global internet glitched twice in two weeks? And what are the lessons we can learn for both businesses and network providers?
Understanding CDNs’ critical role in modern web delivery
Still an elusive black box to even the most seasoned network professionals, the internet is a complex ecosystem of providers and interdependencies. As such, outages are inevitable. Even the most sophisticated companies experience outage events that can have significant impacts on their customers and global internet users.
To build, operate, and troubleshoot applications and services successfully over the internet, you first need to understand the web’s underlying protocols and services – and this includes CDN providers.
A single web page may be composed of dozens or even hundreds of web objects – some configured to be stored by a CDN’s caching servers, and others configured to be refreshed from the origin frequently or even with each user request.
Many popular services are composed of dozens or hundreds of different web objects which leverage different CDN providers to deliver content to users, primarily for redundancy but also for optimising performance. For example, user requests could be load balanced across multiple CDNs using DNS query responses. Alternatively, the root object for a site could point to an index.html file served by one particular CDN provider, but subsequent site components could be served by different CDN(s) or other sources.
Ultimately, how a site or application owner chooses to architect its content delivery can determine the severity of impact of an outage like the one Fastly experienced.
Inside Fastly’s outage
When the Fastly outage began, there was a dramatic, global drop in the availability of its service – but not all the content it delivered went offline, and not all customers were equally affected. Some of Fastly’s customers were able to minimise the impact to their services by leveraging alternative providers to deliver content.
From ThousandEyes’ analysis of the outage, it’s clear that differing delivery architecture and mitigation plans led to a number of different outcomes for their users. For example, some customers were using Fastly’s service as the sole CDN for their primary site domains. However, that doesn’t mean they were affected in the same way. On the one hand, you had businesses that were affected throughout the outage, while others were not.
One company using only Fastly was still able to remediate issues ahead of Fastly implementing a fix to its service. By looking closely at the network path, it’s clear site operators rerouted traffic away from Fastly servers to GCP ones to lessen the impact of the outage by implementing a manual update about 40 minutes after its site went offline.
At the same time, there was also evidence of another customer using not one but three CDNs to deliver its site. Not only did it use its own CDN service, but it also leveraged a diverse set of other providers to continuously load-balance traffic across each and deliver the best possible experience to visitors. When the Fastly outage hit, the customer began removing Fastly from its DNS responses to move traffic away from the affected servers.
While the Fastly outage was broad and significant, not every site utilising its services experienced severe effects. What became obvious was that customers using multiple CDNs were only partially affected and eventually were able to fall back on alternative providers. Businesses using Fastly as their sole CDN were taken offline completely, and although we know some customers were able to redirect users to their origin servers, the manual process resulted in further delay in getting their services back online.
A reminder on redundancy
If the Fastly outage has taught us anything, it is that it is fundamental that an enterprise understands and has visibility into all of its third-party dependencies that can affect customers’ web and app experience – even indirect, “hidden” ones. Businesses should also consider ways of diversifying their delivery services, including two or more CDNs to ensure optimal delivery and to reduce the impact of any one CDN experiencing a disruption in service.
So, what about network providers? With end-to-end visibility, organisations can continuously evaluate the availability and performance of service delivery and therefore have informed conversations with service providers when things go wrong. As such, service providers need to ensure they work with businesses to respond to and resolve issues quickly, while optimising performance for the future.
As businesses continue to adjust to digital-first services, IT teams are expected to manage a complex ecosystem of internet-dependent services and providers that reside beyond their corporate perimeters.
The outage events of recent weeks are a clear reminder that this complicated web can break. It has also shown that organisations need a back-up plan and insight into the services they depend on and beyond. Part of this redundancy plan must involve working collaboratively with service providers, with clear lines of communication.
Together, all parties will be better prepared for our future hyper-connected world.
06 September 2021 | Sébastien Latouille
25 January 2021 | Guest
20 January 2021 | Guest
18 January 2021 | Kyle Myers