The Amazon AWS outage on April 21, 2011 took many people by surprise. After all, simply having your servers on “The Cloud” solves all of your problems right? The outage that brought down Quora, Reddit, FourSquare, and even paper.li was an eye opener for many people. They bought into the hype of “The Cloud”, without fully understanding its limitations and the burdens of running your own servers. And make no mistake, using Amazon EC2 as a host is in many ways is the same as maintaining your own dedicated server. Because of this, having your sites on AWS can actually increase your risk compared to choosing a managed solution.
Information is still coming in about yesterday’s “networking event”, but from what I’ve read so far, a software bug caused the northern Virginia data center (US-EAST-1) to make many extra copies of the hard drive images (EBS volumes). This caused all of Amazon’s vast resources to be used up. In short, they DOSd themselves. As EC2 customers saw their websites were having issues, they logged in to the Amazon AWS control panel and proceeded to restart their servers (instances). This influx of server reboots increased the load further. It got to the point where the instances could not even be stopped correctly. After that, it was just a free-for-all as the anger vented from users to the site owners to the server admins to the Amazon forums to Twitter, Facebook and basically all social media sites not brought down by the outage.
WebDevStudios has a number of servers in the US-EAST-1 availability zone. We were fortunate enough to not be affected at all by this outage. Now, when a provider’s services break down at the system level, you are somewhat at the mercy of circumstances. However, some websites were able to handle this outage better than others. Here are 5 tips that can help you build a robust Amazon Web Services Cluster.
1. Don’t Panic
This is always the first rule. If you have planned for an outage, you have no reason to panic. You have tools at your fingertips to take the actions you need to in order to rebuild your site. Also remember as much as downtime sucks, it happens to everyone. If you are up 99.9% of the year, that still allows for 8 hours of down time. Amazon does one better and promises to refunds part of your fees if you are not available 99.95% of the time, but that still allows for more than 4 hours of down time. So, even as your down time increases, stop, take a breath and plan your actions. Don’t just go out and flip switches (and stop instances) because you have to do something.
2. Don’t put all your eggs in one basket
Elastic Load Balancers (ELB) are a service provided by Amazon. They serve as the entry point to your site. They also add a layer of indirection between your DNS and the instances delivering your site. This means that if you have a instance that is not working correctly, you can instantly remove it or add another one. Some of the sites that recovered very quickly from the outage were able use their load balancers to point at instances in other data centers and to remove the affected servers from the mix.
3. In fact, always carry several baskets
Amazon’s fully managed MySQL servers are called Relational Database Servers (RDS) instances. RDS makes setting up a database cluster much simpler. It provides the ability to have a master server and many read replication servers. This not only makes your site faster but also provides backups for when a server goes down. RDS also has an advanced feature called Multiple Availability Zones (MultiAZ) In short, you always have a clone of the database server running in a physically separated area. Part of what made this incident bad is that Amazon’s system was so overwhelmed they were very slow to switch over to the alternate Availability Zone and sites with their backups in another AZ that was also affected felt no relief. All that aside, MultiAZ allows you to do server maintenance and reboots without any down time and it does usually help in the event of outages.
4. Have extra eggs
BACK UPS, BACK UPS, BACK UPS.
For AWS, they are called Snapshots. You can very quickly make a back up of each server hard drive. Always be ready to rebuild your machine in a matter of minutes it it goes down for any reason. If all of your source code is in a version control system like GitHub, you don’t have to worry about that. If your uploads are on a Content Delivery Network, like Amazon S3, you don’t have to worry about that. Your snapshots can be current to when the last major server change occurred. Then when any type of server in your cluster goes down, or if you need one to keep up with current load, using a snap shot to stand up a new machine can make a very bad situation no big deal.
5. Be ready to take your baskets and run
The other reason to have current snapshots on hand is the ability to bring up new new servers in a completely different data center. This is what many sites did to recover. Since Virginia wasn’t working for them, they moved all of their servers to California. If you’ve planned for it, moving your hosting across the country can be a (relatively) painless task completed (relatively) quickly.
Above all remember: Omelets happen.
With extremely complex system having many layers of responsibility, there will be failures and issues will occur that are out of your hands. However, if you have thought through and diligently applied your emergency plans, you can take most situations in stride.
If you would like to talk to WebDevStudios about how we can create a robust hosting solution for you, please use the contact menu above.