Viktors Rotanovs home

Understanding Amazon EC2 failures and redundancy

I’ve been looking after several websites on Amazon EC2, and I’d like to share some thoughts about how Amazon EC2 can fail.

Shared servers and avoiding SPOF

Only two instance types run on dedicated servers: c1.xlarge and m2.4xlarge. Other instance types use shared servers.

If you start multiple instances for redundancy, there’s a good chance they will land on the same server, therefore not providing redundancy. For example, I’ve just tried launching eight instances, and three of them landed on the same physical server (as confirmed by traceroute).

Therefore, to avoid having a single point of failure, most likely you’ll need to start instances in more than one Availability Zone.

EBS durability

EBS does not provide real durability because of writeback caches. Some people claim that’s an advantage.

Replicating data to a second EBS volume in the same Availability Zone will not prevent disasters:

The only solution is to replicate data to another Availability Zone.

The devil is in the details

Having perfectly running system today does not guarantee you’ll be able to run it tomorrow.

In case of forced server shutdown by Amazon, hardware failure, or stopping and starting an EBS root instance, the following may happen:

There is no guarantee that your OS will support these changes.

Also, some Linux kernel bugs exist only when running on EC2.

Conclusion

Creating reliable systems on EC2 can be hard, because critical services may need to span multiple Availability Zones. While most of the time systems will run fine, extensive sysadmin skills may be required when things go wrong.

P.S.

Here are two common setup mistakes when you take factors above into account:

Do you know more examples?

blog comments powered by Disqus
Fork me on GitHub