I’ve been doing some research lately on best practices that pertain to the correct use of AWS EBS volumes from the point of view of data redundancy – in particular around the relationship between combining EBS volumes in a software RAID volume at the OS level (either RAID 1 or RAID 10) and the proper use of EBS snapshots.
On the AWS EBS details page (http://aws.amazon.com/ebs/details/) in the section on “Amazon EBS Availability and Durability” we find the following:
Amazon EBS volumes are designed to be highly available and reliable. At no additional charge to you, Amazon EBS volume data is replicated across multiple servers in an Availability Zone to prevent the loss of data from the failure of any single component. For more details, see the Amazon EC2 and EBS Service Level Agreement.
The durability of your volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot. As an example, volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS Snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume. This compares with commodity hard disks that typically fail with an AFR of around 4%, making EBS volumes 10 times more reliable than typical commodity disk drives.
The statement that really puzzled me for a while was the claim that EBS volumes are more durable (with a lower failure rate) when EBS snapshots for them are created more often. It just doesn’t appear that the two would be related … at least not when we think about drives and backups in a non-cloud, non-redundant way. I thought for a while that this was just some strange marketing statement made by AWS – almost like a reverse Murphy’s law: the more backups/snapshots you make the lower the chance that your drive will fail. I understand that having more frequent EBS snapshots of a volume would enable one to restore more recent versions of that volume’s data from the snapshot but how exactly could taking snapshots affect the rate at which a volume would physically fail?