Outages 101: lessons from the web’s inner workings
Sponsored Content

Outages 101: lessons from the web’s inner workings

Kentik Doug Madory.jpg
Doug Madory

Doug Madory, director of internet analysis at Kentik, has a variety of tools that give him an eagle-eyed view of what’s happening across the web. Dubbed “the man who can see the internet” by The Washington Post, he talks to Capacity about what we can learn from internet outages and how this information can help us going forward. 

How do you see the current environment concerning internet outages worldwide in light of the rise of digitalisation in recent years?

I publish analysis on some of the internet’s biggest outages. On any given day, there are a lot of things that are broken and offline around the world, but I spend more time on the events that end up in the headlines because they have the most impact on people and businesses.

Earlier this year, there was a major Microsoft Azure outage, while 2021 had several major incidents – including a Facebook outage that October, which will probably go down as one of the great outages of the internet thus far. There was also a Rogers Communications one last summer, which was Canada’s largest-ever internet outage. If we take a step back and look at these, we can see a few different trends emerging.

For instance, there have been a lot of discussions in the aftermath of the Facebook and Rogers outages about the lack of effective ‘out-of-band’ networks – referring to infrastructure that’s completely disconnected from the main network and could be used to communicate in the event of an outage. However, a separate infrastructure is not easy to maintain and can introduce its own security concerns, so this is easier said than done.

What other trends do you see in this area?

Cloud computing is growing every year. That’s led to consolidation in terms of how many networks are pushing large amounts of traffic, with services like Amazon Web Services [AWS], Microsoft Azure and Google Cloud Platform becoming ever more popular.

This means when one of those platforms goes down, it takes many more things down with it. In the aftermath of the AWS outage in December 2021, there was a question of whether putting all our eggs into a few baskets is leading to vulnerability.

But it’s not obvious that we’re in a net worse place because a few large public clouds are now serving a huge number of companies. Of course, you can’t do this in practice, but if you could somehow tally up all the outages that were avoided because they were outsourced to the cloud, I think we’re still coming out ahead in terms of uptime.

Why is it important to analyse outages?

One reason is that folks in the industry follow these incidents even if the outage doesn’t affect them. They want insight into whether they might have a similar vulnerability in their own environment.

When a company that suffers an outage publishes a root cause analysis (RCA), people in our community will scan it, looking for takeaways they could harness.

Such reports usually offer a few nuggets so network engineers can better understand and learn from the outage. However, they often lack detail because the company has to be mindful of what they divulge. It therefore helps to have independent data-driven analysis like ours to verify the assertions in those reports using empirical evidence.

What are some common causes of these outages?

After each of these incidents, I and other analysts will get asked, is this a cyberattack? And almost always, the answer is no: the vast majority of these big outages are self-inflicted due to misconfiguration errors of one sort or another being pushed into the network.

An overarching trend that touches many of these analyses involves the risks incurred by centralised automation. To run these super-large networks that are now taking on the infrastructure of many thousands of businesses, cloud providers have no choice but to develop centralised automation software to administer their exceedingly complex environments.

But that means if there’s a bug in that software, it can cause tremendous damage. Due to its scale, there’s a lot of complexity to manage, even for a large team of experienced engineers. I think the first step is acknowledging this complexity as a risk and approaching it with humility. If companies with seemingly infinite resources like the major public clouds can get caught out, then so can other networks.

What does Kentik bring to the table to try and resolve such issues?

At Kentik, we offer software-as-a-service [SaaS] capabilities that provide our customers with a level of network observability that they couldn’t otherwise attain. One of the primary sources for tracking how traffic is moving through a network is NetFlow, a protocol developed by Cisco Systems.

We also offer ‘synthetics’ – capabilities that enable digital-experience monitoring and can provide alerts to performance problems. Furthermore, Kentik monitors the Border Gateway Protocol [BGP], a long-established routing protocol for the internet and a common source of connectivity problems. Lastly, we offer DDoS protection services that help detect and neutralise distributed denial-of-service attacks, which can be the cause of outages.

The posts I write are dual-purpose. One aim is to present our tech and show how Kentik provides observability using several independent types of measurement. In this respect, the analysis of last month’s Azure outage is an excellent example, as it demonstrated our capabilities using NetFlow, BGP and synthetics monitoring.

Secondly, I’ve been writing about this area for around 12 years now, having joined Kentik in 2020 and previously led internet analysis at other companies. I think of it as my mission to compile all the data-supported observations one can make about an outage. We want to put our analysis out there so that when a company publishes its after-action report, it can be combined with our independent analysis to inform the industry better.

What types of visualisation are most valuable for diagnosing the root cause of internet outages, and what can things like NetFlow tell us about outages?

We do a lot of visualisations, but I think we’re best known for Sankey diagrams, a type of flow diagram that in our case illustrates how volumes of traffic traverse a network. Most of my work involves analysing time-series charts to track things like diurnal patterns of traffic. In these types of chart, it’s straightforward to see where the disruptions occur.

Most of the traffic graphics I generate come from a tool called Kentik Data Explorer [KDE], our main way of interrogating NetFlow. Its versatility has made KDE a powerful tool, offering many dimensions that one can explore.

KDE is a very flexible investigation tool that allows the user the ability to ask an endless number of questions about their data. In fact, it’s not uncommon for our customers to come up with new uses for KDE that we hadn’t even thought of.

What’s the current situation with regard to BGP, and how can routing security help to protect against outages?

BGP, as with many of the original protocols for the internet, has no security built into it. We’re therefore dealing with the challenges of how to bolt on security after the fact – and BGP in particular has been a long-running challenge to secure.

The internet community’s approach to routing security has coalesced around RPKI [resource public key infrastructure], a framework for validating BGP announcements between internet networks.

Under a method provided by the relevant regional internet registry, a network provider enables this by creating a route origin authorisation [ROA] that tells the world the proper origin for each route. This allows other networks to reject announced routes that don’t conform to the ROA and reduces disruptions by limiting the propagation of errant BGP announcements.

I published a couple of pieces of analysis last year about where we are with RPKI deployment. For a long time, it was just a promising idea, with little progress to point to. However, in the last couple of years we've crossed a tipping point of RPKI adoption, both regarding ROAs generated and the major networks now rejecting invalid routes. It’s heartening to see that we’ve finally made progress on this.

What do you hope your and Kentik’s work on this area of internet outages will achieve in the industry?

Kentik’s mission is to provide companies with greater observability of their networks and internet connectivity, leading to reduced downtime and better-performing systems. With the analysis I do, I like to educate – and perhaps even entertain. We believe publishing our outage analyses is vital to the internet community.

We don’t have all the internet’s traffic information, but we have a significant slice of how the internet looks at any given time. This gives me the ability to query our data and say, “country x is offline” or “Facebook is down.” I feel a certain obligation to use our unique data sets to help tell the story of the internet.

If someone thinks all of this makes Kentik a good company, that’s great; if it makes them want to buy something, that’s even better. But even if they don’t, we’ll have informed the discourse in the industry using our unique capabilities, providing a service to the community.

Gift this article