Amazon Web Services’ cloud platform in 'fat finger' outage

March 03, 2017 06:00 AM

Amazon Web Services had a major five-hour outage this week that affected thousands of corporate customers

The problems started with the US-East-1 region, hosted in data centres in Northern Virginia. The incident took out thirty-three of AWS’s services including nine services which failed completely: Athena, EMR, Inspector, Kinesis Firehose, Simple Email Service, S3, WorkMail, Auto Scaling and CloudFormation.

As a result of this many hundreds of thousands of cloud-based applications and websites were forced offline.

Amazon has made a statement claiming that an incorrectly typed command during a routine debugging of its billing system caused the outage which they said lasted five-hours this Tuesday.

AWS claims that a command meant to remove a small number of servers for one of its S3 subsystems was entered incorrectly and a much bigger tranche of servers was removed. This required a full restart of all affected servers which took longer than expected.

Amazon says it is making changes to its system to make sure more ‘fat finger’ mistakes cannot happen.

The incident, however, has caused chaos for users and leaves many questions to be answered in the new world of mass cloud-adoption. While bean-counters worldwide up to Tuesday have been heralding cloud as the best way to get lean and chuck out that pesky, expensive IT department, they would have been weeping into their skinny lattes Wednesday as they started to count up the cost of at least five hours when their revenue-generating websites were down.

In an early statement during the outage, AWS said: “We have identified the issue as high error rates with S3 in US-EAST-1, which is also impacting applications and services dependent on S3. We are actively working on remediating the issue.”

The outage was allegedly caused by the Simple Storage Service (S3), a component of the AWS platform used by many of their cloud-based products.

S3 uses the AWS infrastructure to host its global website network. It stores and retrieves customers’ cloud data.

Affected organisations included Business Insider, Expedia, Coursera, Quora, and Slack.