CDNs: Down but not out

August 31, 2021 12:09 PM

Following the well-publicised CDN outages of 2021, Capacity’s Natalie Bannerman explores how we can future-proof this infrastructure to avoid such events happening again

In the age of content streaming, we are all likely to have some kind of video streaming service such as Disney+, Prime Video or Netflix in our homes and – depending on our interests – some kind of gaming service in addition.

But how many of us actually think about the networks that deliver this content to us? The global video streaming market was valued at $50.11 billion in 2020 and is expected to increase at a compound annual growth rate of 21% from 2021 to 2028.

With such a large market, it is important that content delivery networks, sometimes referred to as content distribution networks (CDNs), are not only optimised and scalable, but secure.

As a series of geographically distributed network proxy servers and their data centres, CDNs are just as susceptible to outages and security threats as any other form of infrastructure.

Network blackouts

A case in point occurred earlier this year, when Fastly, the cloud computing services and CDN provider, caused a widespread digital blackout as a result of a software bug that brought down its edge cloud platform. The hour-long outage affected the likes of Reddit, gov.uk, Twitch, Spotify, Amazon, the New York Times, The Guardian, CNN and the BBC.

At the time, Fastly’s SVP of engineering and infrastructure, Nick Rockwell, confirmed that the incident was triggered “by a valid customer configuration change” following a software deployment on 12 May.

Following this, on 8 June, a customer pushed a valid configuration change, which included “the specific circumstances that triggered the bug, which caused 85% of our network to return errors”, said Rockwell.

Though the Fastly incident was short-lived, a week or so later a similar disruption occurred in Australia, when web services company Akamai also faced a problem with its CDNs – this time taking down airlines and banks including ANZ, Westpac, St George, ME bank, Macquarie Bank, American Airlines, Southwest Airlines, United Airlines and Delta Air Lines.

At the time, Akamai confirmed that it had “experienced an outage for one of its Prolexic DDoS services (Routed 3.0)”, affecting roughly 500 of its customers. The company was keen to stress that: “The issue was not caused by a system update or a cyberattack. A routing table value used by this particular service was inadvertently exceeded. The effect was an unanticipated disruption of service.”

As a result of these disruptions, a wider conversation around CDN security and resiliency was sparked, starting with the most prevalent question: will these types of outages become the new normal?

The new normal

According to Ranjan Goel, vice president of product management at LogicMonitor, a cloud-based infrastructure monitoring and observability platform, as our infrastructure becomes more complicated, the chances of such blackouts also increase.

“As our IT infrastructures grow in complexity, sweeping outages are likely to increase in frequency and severity unless infrastructure monitoring capabilities keep pace with the rate of complexity,” he says.

“The only way to prevent issues resulting from IT infrastructure – such as CDNs – that may lead to widespread outages is through holistic visibility into the entire IT infrastructure to identify problems before they result in wider damage.”

His answer to this problem is leveraging automation and AI – because, as he puts it: “This is a task that humans cannot deal with alone: AIOps and machine learning software must be put to the task.

“AI solutions can pore over the billions of data points IT environments produce and quickly alert IT teams when there is an issue that needs looking at before it results in an outage event,” he explains.

“This may not prevent every outage, as there is no single silver bullet, but it will greatly mitigate the issue and shorten outage mitigation times if and when they do occur.”

Interestingly, Andy Still, CTO of Netacea, a provider of AI-enabled bot detection and mitigation, takes a slightly different view, believing that “outages like these are very rare”.

“Platform resilience is generally getting better, and outages are much less common than they used to be. This is driven by improvements in technology and automation of high-availability systems – systems that are designed to be highly available so will automatically failover to replacements in the event of any issues,” he says.

If we resign ourselves to the fact that more infrastructure blackouts are likely to occur, the question then becomes how to minimise the effects of such blackouts – and perhaps increased competition in the space is the answer. Since the Fastly outage many industry experts have said that the overreliance on a small number of cloud/CDN providers means that if services go down, it is much bigger in scale and impact.

“Companies should use a multi-CDN infrastructure from multiple vendors to minimise, or even avoid, the impact of catastrophic outages like the Fastly and Akamai events,” said Kris Beevers, CEO at NS1, an application traffic intelligence company.

“This also helps them avoid lock-in and gain leverage to keep CDN costs in check. But they must have high observability of their global application delivery performance as well as the ability to immediately take action if a CDN fails to perform as expected.”

This sentiment is echoed by Magnus Bjornsson, CEO of Men&Mice, a provider of sustainable network management, who reminds us that as more business rely on a small circle of CDN providers, when problems do occur “the fallout is likely to be sweeping”.

“The only effective way of resolving this issue is to add redundancy,” he adds. “It is therefore critical that CDNs have a proper redundancy solution in place, but also for the users themselves to think about redundancy and build their product to not use a single CDN service.”

As important as redundancy is, Still says that the larger size of some CDNs is directly tied to its effective content delivery and in fact the use of smaller providers could lead to more outages altogether.

“One of the key benefits of a large CDN is its size. In fact, size is one of the drivers for using a CDN. The more companies using this service, the better the underlying network can be – a large number of small CDNs would lose this benefit,” explains Still.

“The size of a bigger CDN means the impact appears larger, but that is simply because there are a lot of simultaneous outages. Smaller companies would likely have more outages, but they would be more distributed, so not as noticeable.”

Security

The most important part of the solution is, of course, security. But first, how are they currently being fortified? Well, like most other cloud-based infrastructure this includes everything from DDoS mitigation, SSL certification, application firewalls, monitoring and visibility platforms.

“Leveraging multiple managed DNS services to ensure optimal DNS redundancy is quickly becoming the clear best practice for CDNs which are being used by businesses,” says Bjornsson.

“That means automatically taking care of the replication and synchronisation of data in a reliable and consistent way.”

On the topic of building high availability into the key layers in the infrastructure stack, Paul Speciale, chief product officer at Scality, says that software-based virtualisation and software-defined storage and networking have become “commonplace in the data centre, and they leverage commodity hardware” meaning that “high availability, security and manageability really need to be planned at all the key layers in the infrastructure stack”.

The key attributes required to achieve “gold standard” availability – or the commonly referred to 99.999% availability – Speciale says, include solutions built on distributed systems, with redundancy in both software and hardware components to eliminate single points-of-failure.

“These systems should be designed to fully anticipate and expect failure events to occur: components fail, services fail – so modern system design is to expect failures to happen and have a design that can route around the failures through alternative paths,” adds Speciale.

Aside from the use of AI and automation, to detect and correct problems and cost-effective networking, he also says that “self-healing systems have become more common”, which means the ability to restore automatically from events such as server failures or disk drive failures, by rebuilding data and storing them redundantly on other servers or disk drives to restore protection levels.

Overall, Still says that “any good security approach will consider security of both the infrastructure and application”, pointing out that often security attacks occur via the business logic of an application.

“So, rather than exploiting a technical weakness, they undertake legitimate activity for illegitimate aims – for example, creating thousands of fake accounts to get a free bonus with each account.”

Streaming content continues to surge, not just from video and gaming but also from the high-definition delivery networks by non-media companies,” says Goel.

This in turn means that these companies are recognising that CDNs are now part of enterprise architecture that “needs to be actively monitored for the overall availability of their business services.”

As such, “It’s important that CDNs are designed with redundancy in place to ensure that they can continue delivering content to users without failure,” adds Bjornsson.