In my capacity as PDMS' marketing representative in London, I recently attended a number of seminars on cloud computing at the Cloud Expo 2012.

The most memorable and beneficial session, for me, was one which discussed the outages Amazon Web Services (AWS) suffered last year...

Amazon's Web Services (AWS) is a collection of remote computing services which together make up a cloud computing platform. This platform is used by countless household names including The Guardian and Washington Post newspapers, TweetDeck,  Foursquare, Ticketmaster and of course, the Amazon websites. 

The AWS cloud platform allows businesses 'instant elasticity'; open, flexible and secure hosting, and crucially, to 'pay-as-they-go' without long term commitments.  It can be a remarkably economical way of getting things done, with consumers only paying for what they use, and without being tied into a contract.  In actuality, what this means for businesses, is that they can move their information from data centres which are geographically near their premises, to cloud 'regions', anywhere in the world.  These 'regions' can provide a better service at a fraction of the cost.  Cloud computing is, they say, a 'game changer' and the most powerful computing development to become mainstream in a generation.  And that's all great of course, until it goes wrong.

In 2011 AWS suffered a series of outages, one lasting for 5 days.  It affected thousands of businesses; some of which lost revenue as a result.  Some developers saw the AWS outage as a warning of what happens when we rely too heavily on cloud. The real failures don't just belong to AWS, however, but also to the sites that use it.

The problem for sites that were brought down by the AWS outage, is their own failure to implement one of the key design principles of the cloud; to design with failure in mind.  AWS explicitly advised developers to design a site's architecture so that it was resilient to occasional failures and outages.  This sound advice was ignored by many.

As one blogger pointed out at the time, "In short, if your systems failed in the Amazon cloud… it wasn't Amazon's fault.  You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon's cloud computing model. The strength of cloud computing is that it puts control over application availability in the hands of the application developer and not in the hands of your IT staff, data center limitations, or a managed services provider."

Even Amazon customers affected by the outages, prefix their criticism with praise for AWS, and what it has enabled them to do with their business, such as BigDoor's CEO Keith Smith:

"AWS has allowed us to scale a complex system quickly, and extremely cost effectively. At any given point in time, we have 12 database servers, 45 app servers, six static servers and six analytics servers up and running. Our systems auto-scale when traffic or processing requirements spike, and auto-shrink when not needed in order to conserve dollars."

The seminar focused on a case study of one of the companies using AWS which was unaffected by the outages and the reasons why they, unlike so many other companies, had survived unscathed.

Netflix [], the TV and movie streaming company, has been using AWS for two  years, and as part of their cloud deployment strategy, have been developing their application in conjunction with tools to 'kill off' at random, instances and services within their architecture.

While one department works on improving the recommendation engine, or the payment system, another department works on a tool designed to break it. Netflix's objective is to design a cloud architecture, where individual components can fail without affecting the availability of the entire system.  As John Ciancutti, VP of Personalisation Technology at Netflix explains it, "If we aren't constantly testing our ability to succeed despite failure, then it isn't likely to work when it matters most - in the event of an unexpected outage."

The original tool was called their 'Chaos Monkey'.  There are now 7 'monkeys' in total, each performing a different, but ultimately destructive job.  Their goal is to induce various kinds of failures,  detect abnormal conditions, and test the ability to survive them;

  • Chaos Monkey - randomly shuts down a part of the Nefix architecture to see if it would still hold up
  • Latency Monkey - simulates service degradation by inducing artificial delays
  • Conformity Monkey - shuts down instances that don't comply with best-practices
  • Doctor Monkey - taps into health checks that run on each instance as well as monitoring other external signs of health (i.e. CPU load) to detect unhealthy instances
  • Security Monkey - searches for security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances
  • 10-18 Monkey - (Localization-Internationalization, or l10n-i18n) detects configuration problems in instances serving customers in multiple geographic regions, using different languages and character sets
  • Chaos Gorilla - is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone

Perhaps if AWS had had their own 'monkey' system, their outages wouldn't have occurred, or at least wouldn't have been quite so bad.  This isn't to say that Netflix never goes down, it does.  And after making much of their 'monkeys', this potentially embarrassing situation has been blamed on their legacy systems - which naturally don't have their own monkeys!