Amazon EC2 and Sony PSN Failures Highlight Need for Education
The outage of the Amazon Elastic Compute Cloud (EC2) two weeks ago left businesses scrambling, and the same thing happened to consumers during last week's data breach of Sony's PlayStation Network (PSN). Between the two incidents, the basic nature of the cloud's potential for disruption was revealed to consumers and businesses alike, with the realization that reliance on the cloud must not come with blind faith but with a good bit of understanding of vanilla web issues of performance, reliability, scalability and security.
The issues surrounding each incident—Amazon lost a number of hard drives (volumes) that had a ripple effect on many other computers in a single data center, while Sony had a security breach that netted millions of passwords and thousands of credit cards—are important to understand.
Sony's issue is as much a public relations one as a security breach, with the company withholding information about the exposure of passwords, credit card, and other personal information for several days. Since then, according to the company, it is "rebuilding the network by hand" in terms of its data storage and security features. Separate services, such as Netflix viewing on the PlayStation 3, weren't affected by the PSN breach and subsequent outage since they are delivered from different cloud-based services, such as Amazon's EC2.
Amazon's EC2 issue was a bit more complicated, and goes to the heart of reliance on heavy-computing cloud-based services. One issue, raised by what one pundit dubbed the "cloud hater" crowd, challenges the premise that cloud computing is better designed for scalability and redundancy than the average enterprise server farm.
Misleading Cloud Marketing
A big part of the blame, according to "cloud hater" logic, is the marketing and sales approach to the cloud. One of the stronger sales pitches for the cloud is the fact that data is reliably kept in the cloud—with no need for localized backups—ready to be accessed at any time. It's almost an "upload and forget" approach that says a customer's data is ready and waiting whenever they need it, and that customers shouldn't really worry about the inner workings of the cloud.
The objection to this argument has often centered on the fact that network outages and intermittent network connectivity are consumer issues, and few businesses face complete downtime on their internet connections. Yet the FCC's broadband studies tell a story of uneven and, at times, unavailable connectivity. Few consumers live in an always-on world, going through frequent times of intermittent connectivity, and both virtual and rural businesses face similarly inconsistent connectivity.
Within the streaming world we've always prepared content to be delivered across intermittent networks, from the early days of true RTP streaming to the more recent HTTP delivery of MPEG-4 segments or adaptive bitrate video.
What hadn't been discussed at length, at least not until the Amazon EC2 outage, was the intermittent availability of the data center and the intentional lack of redundancy across data centers.
Given the marketing around the cloud, one could easily assume that the "upload and forget" model came with built-in redundancy for every EC2 customer. Amazon's response shatters that myth, although the company is working hard to fix the issue.
The Amazon EC2 Outage: What Happened
EC2 itself didn't go down completely, but the impact on stored website data could be seen across many websites due to the fact that the issue, according to Amazon:
"primarily involved a subset of the Amazon Elastic Block Store (“EBS”) volumes in a single Availability Zone [e.g., data center] within the U.S. East Region that became unable to service read and write operations."
The culprit wasn't the nodes themselves, but instead a network equipment upgrade Amazon says was
"performed as part of our normal AWS scaling activities in a single Availability Zone in the US East Region."
These incapacitated drives caused a ripple effect throughout the EC2 infrastructure, as each affected node (and cluster of nodes) searched for other nodes with enough storage space to replicate data. During the time that content is being replicated, access to it is locked out. Amazon said in its post-mortem report that
"Nodes failing to find new nodes did not back off aggressively enough when they could not find space, but instead, continued to search repeatedly. There was also a race condition in the code on the EBS nodes that, with a very low probability, caused them to fail when they were concurrently closing a large number of requests for replication."
As with any redundant system, one would assume the content was being stored off-site at multiple locations—a common practice in enterprise server solutions. Yet, for all the cloud marketing, the redundancy across multiple locations or Availability Zones wasn't necessarily the case for EC2, since Amazon charges more for storage across multiple Availability Zones.
In its report, the company seems to lay some of the blame on customers for not choosing the multiple-zone option, or not writing applications to take advantage of these multiple zones.
Still, if the marketing about redundancy and reliability is to believed, customers shouldn't have needed to understand or work across multiple Availability Zones
The Amazon outage came quickly on the heels of a new-media / broadcast roundtable discussion I moderated in Las Vegas during the recent National Association of Broadcasters show. The roundtable was sponsored by Microsoft, iStreamPlanet, and Interxion, the latter being a data center facilities provider that uses a two-data-center-per-city approach to blanket European cities.
One of the issues raised at the roundtable was that of cloud reliability. Even prior to the Amazon outage, there were questions raised about speed of transport and reliability of cloud services for mission-critical applications. One participant even quipped that, while they relied on their technology partners to recommend tried-and-tested solutions, an issue with the cloud was establishing liability in the event of a cloud outage.
"We can't sue the cloud," the participant quipped.
In the Amazon instance, however, the company understands the impact that outages have on customers. While refunding money for a few days of outage isn't going to bring back the lost revenues many of the companies faced, it does appear Amazon will relax its policy around charging extra for storing data in multiple Availability Zones.
Amazon also has its work cut out for it both educate the potential EC2 customer as well as to correct and expand its software code, and admitted as much when it announced a series of webinars:
"The first topics we will cover will be Designing Fault-tolerant Applications, Architecting for the Cloud, and Web Hosting Best Practices. The webinars over the next two weeks will be hosted several times daily to support our customers around the world in multiple time zones. We will set aside a significant portion of the webinars for detailed Q&A. Follow-up discussions for customers or partners will also be arranged."
In addition to the webinars, Amazon is making available whitepapers on AWS architecting best practices, and will also modify its services to allow multi-zone balancing automatically, without customer intervention.
In other words, Amazon looks to move beyond the outage with a series of action items to automate recovery and redundancy in the cloud in a way that most enterprise customers have been used to for years.
Rather reminds one of a variation on the old nursery rhyme: when it works, it is very, very good, but when it doesn't, it is awful.
Streaming video is having a great impact on education, for both younger and older learners.
Study also notes that video is the main driver behind bandwidth usage, and that most mobile data is consumed via wireless cards, not mobile devices