The technical issues behind CenturyLink’s five-hour outage
01 September 2020 | Melanie Mingas
Insight published over the last few hours has potentially solved the mystery of a widespread internet outage that saw global traffic drop by 3.5%.
The five-hour outage, claimed by CenturyLink in a series of tweets, was traced back to a misconfiguration in a data centre, but Thousand Eyes has now released detailed insight confirming “internal controlplane issues” and “unstable BGP” impacting CenturyLink’s service providers Level 3 and, in turn, Zayo Bandwidth.
Explaining what happened Thousand Eyes said: “Traffic termination is certainly problematic, but what made this outage so disruptive to Level 3’s enterprise customers and peers, is that efforts to revoke announcements to Level 3 (a common method to reroute around outages and restore service reachability) were not effective, as Level 3 was not able to honour any BGP changes from peers during the incident, most likely due to an overwhelmed controlplane.
“Revoking the announcement of prefixes from Level 3, or preventing route propagation through a no-export community string and even shutting down an interface connection to the provider, would have been fruitless,” Thousand Eyes added.
Reportedly causing global internet traffic to drop by 3.5% the outage took out the likes of AWS, Xbox Live, EA, Reddit, and Cloudflare.
In its analysis, Thousand Eyes concluded: “The dynamic, uncontrolled (and contextual) nature of internet routing was on full display during this incident, underscoring the significant impact of peering and provider choices — not only your own, but those of your peers and their peers. The deeply interconnected and interdependent nature of the internet means that no enterprise is an island — every enterprise is a part of the greater Internet whole, and subject to its collective issues. Understanding your risk factors requires an understanding of who is in your wider circle of dependencies and how their performance and availability could impact your business if something were to go wrong.
“In the case of this CenturyLink/Level 3 incident, the timing of it dramatically reduced its impact on many businesses, as it occurred in the early hours (at least in the U.S.) on a Sunday morning. Perhaps, that’s one bit of good news we can take away from this incident. And we all could use some good news about now.”
Another outage in April saw CenturyLink’s Level 3 network suffer a “major disruption” due to fibre cuts, which impacted users and businesses throughout multiple regions of the United States.