Last week's failure of Amazon Web Services has led many pundits to question the validity of the cloud model.
Anyone who believes that the AWS outage means that companies are better served managing their own data servers has probably never managed a data server.
Those who have managed data centers know that failures occur. And those failures can occur for many reasons - hardware failures, rack outages, power failures, software maintenance failures and network failures, in worst case those occurring among Tier 1 networks or the IP backbone.
For companies using traditional data centers to host their applications, the solution is to have a strong redundancy plan. A solid redundancy plan generally involves using two or more data centers, often in different geographic regions and connected to different backbone providers.
So, why is it that when many companies move to the cloud, they ignore that approach and use simple zone redundancy (if that) from Amazon? One company that withstood last week's AWS outages with virtually no impact was Netflix. And that's not suprising. Last December, Netflix posted about their approach to using AWS and "designing for failure". Among other things, the Netflix system automatically adjusts what it displays to users, based upon system availability. If the recommendation system is down, they automatically show popular titles rather than personalized suggestions. While Netflix uses AWS, they've not ceded responsibility for planning to others.
Amazon and other major cloud hosts provide for regional redundancy. This ensures that when a region goes down (as happened last week), you can automatically shift your traffic to the region that is still up. Had Quora, Foursquare and others used that approach, the AWS failure would not have had a serious impact on them.
Now, that's not to absolve Amazon from what was a major failure last week in their North Virginia AWS data center. A multi-day failure at a major cloud data center is not acceptable. Yet neither is absolving yourself from the responsibility to build an operations plan that can withstand major outages.
Even after the major failure at Amazon last week, the Cloud offers the same benefits that it did previously in terms of scalability, hardware and software management, physical (hardware) redundancy and more. For almost any web-centric business, the Cloud is the most effective way to deploy. But that doesn't get us off the hook for avoiding failure. It's still up to all companies to understand the various levels of redundancy offered by the host and choose accordingly.