Relationship between AWS EBS snapshots and EBS volume failure rates

I’ve been doing some research lately on best practices that pertain to the correct use of AWS EBS volumes from the point of view of data redundancy – in particular around the relationship between combining EBS volumes in a software RAID volume at the OS level (either RAID 1 or RAID 10) and the proper use of EBS snapshots.

On the AWS EBS details page (http://aws.amazon.com/ebs/details/) in the section on “Amazon EBS Availability and Durability” we find the following:

Amazon EBS volumes are designed to be highly available and reliable. At no additional charge to you, Amazon EBS volume data is replicated across multiple servers in an Availability Zone to prevent the loss of data from the failure of any single component. For more details, see the Amazon EC2 and EBS Service Level Agreement.

The durability of your volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot. As an example, volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS Snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume. This compares with commodity hard disks that typically fail with an AFR of around 4%, making EBS volumes 10 times more reliable than typical commodity disk drives.

The statement that really puzzled me for a while was the claim that EBS volumes are more durable (with a lower failure rate) when EBS snapshots for them are created more often. It just doesn’t appear that the two would be related … at least not when we think about drives and backups in a non-cloud, non-redundant way. I thought for a while that this was just some strange marketing statement made by AWS – almost like a reverse Murphy’s law: the more backups/snapshots you make the lower the chance that your drive will fail. I understand that having more frequent EBS snapshots of a volume would enable one to restore more recent versions of that volume’s data from the snapshot but how exactly could taking snapshots affect the rate at which a volume would physically fail?

I eventually came across two web pages that clarified the mystery for me:

https://forums.aws.amazon.com/message.jspa?messageID=124224#124224
http://www.quora.com/What-is-the-annual-failure-rate-for-1TB-Amazon-EBS-storage

So here’s the explanation:

From the quote above we know that Amazon does something behind the scenes to make sure that an EBS volume is protected against a single physical drive’s failure. Some sort of data copy/replication is involved where data for a single EBS volume ends up on multiple servers in a single availability zone. It’s not clear exactly what Amazon is doing since they’re not really providing details on that (all part of the AWS secret sauce) but somehow we know that at least two copies of an EBS volume’s data have to exist on separate servers.

So with this architecture in mind what does it take then for a single EBS volume to fail?

Say that one of the drives containing data for an EBS volume goes bad (either there is a problem with the drive itself or the server running the drive has issues). Data for that EBS volume still exists in other servers/drives in the AZ but the EBS volume is now not in an optimal redundant state … so Amazon goes to work rebuilding that redundancy by copying that EBS volume’s data to a new drive/server. As probably every sysadmin knows the time when a RAID array is getting rebuilt is the time when the array is most vulnerable to failure. Not only does the array have to put up with normal drive activity but it also has to deal with the additional stress of having to read all existing data in order to copy it as needed to a new drive.

This is where EBS snapshots come into the picture. Without EBS snapshots Amazon would need to copy all data for an EBS volume from one of the existing copies to a newly added drive and this would put additional stress on that existing copy. A failure of the drive containing that last remaining copy would mean the failure of the EBS volume itself (assuming a simplified version of the EBS redundancy model where data for an EBS volume is contained in a mirror set of two drives). With EBS snapshots for that volume available it appears that somehow Amazon is capable of rebuilding that EBS volume as a redundant volume by using data from the latest EBS snapshot and then it would only need to copy from the actual drive replica just the bits that it knows were modified since the last EBS snapshot. If snapshots were taken frequently for that volume it means that less data would need to be copied from an actual EBS drive and most of it would come from the snapshot (EBS snapshots are stored in S3 – a totally separate and highly-redundant storage made available in AWS). Less data to copy from the drive replica means faster copy time and less stress on that drive – so a higher chance that the drive would keep working properly until redundancy is restored.

All in all it does appear to explain how the frequency of EBS snapshots affects the failure rate for EBS volumes … but it’s certainly not intuitive for somebody thinking about drives, RAID and redundancy in a non-cloud environment. From a data redundancy perspective tying EBS volumes together in software RAID at the OS level does not appear to be the preferred architecture in AWS. The approach recommended by AWS involves taking frequent EBS snapshots and letting AWS deal with the redundancy behind the scenes. It all goes to show that some of the best practices from the non-cloud world have to be rethought in a cloud environment.