Amazon, microservices and the birth of AWS cloud computing

I started doing some research on microservices and came across this really interesting video from about 5 years ago where Werner Vogels, Amazon’s CTO, talks about how (and why) Amazon switched to a microservices architecture. It’s a really interesting presentation that explains the challenges that amazon.com was facing in its early years and how internal solutions to those early problems were the basis for AWS cloud computing later.

Werner Vogels – Amazon and the Lean Cloud

It’s a relatively short presentation – about 30 minutes – but it’s full of interesting details about those ‘early days’ of cloud computing. Here are some highlights:

  • In the early 2000’s Amazon’s main e-commerce site – amazon.com – was facing some technical challenges. Its architecture at that time was typical of the web applications we still build today – a single monolith application code base, a common technology stack in all web areas, with massive relational databases on the backend. What were some of the problems they were having in those early days? Code compiles and deployments were taking too long. The backend databases were massive and hard to manage. Bottlenecks existed everywhere – it was getting harder and harder to make progress, release new features and keep up with growth.
  • Amazon’s technical architects analyzed the problem and realized that the path they were on would not take them far in the future. The decision was made to move towards a microservices architecture (they didn’t call it that back then but that’s what they were basically building). The idea with microservices was that every little feature and capability for the retail site would be provided by a mini-service that would interact with other services through well-defined interfaces. This is the path that amazon.com went on for the next few years. According to Werner the current homepage for amazon.com is put together by a few hundred such microservices.
  • It’s hard to believe that such an architecture could actually work at the scale that amazon.com needed – it sounds like the perfect recipe for chaos. Specific changes were needed to how Amazon’s internal teams worked in order to make it work. The idea of “two-pizza teams” was at the core – a team supporting a particular microservice should not be bigger than the number of developers who could eat two pizzas. This usually meant no more than 10 technical folks to such a team – a perfect number for a team that could do work without needing complex meetings to bring everybody up-to-date on progress. Teams chose the technology stack they would use for a particular microservice. Another critical concept was the idea of “you build it, you run it”. These small teams were in charge with development and operations for their service (they were doing devops before it was actually cool). Amazon now had hundreds of such teams working on the amazon.com site.
  • Things were going well initially but they realized after a while that the rate of progress and productivity was slowing down. A more careful analysis of the situation showed that these teams were now spending close to 70% of their time doing operations work – making sure that their services would be operational according to the standards for high availability required for amazon.com. Engineers were solving the same problems over and over on their own because they had no common internal infrastructure resources they could use.
  • This is when the idea of infrastructure on demand started to come up – the beginning of the AWS cloud operations. First, object storage (S3) … then compute (EC2) and on they went from there. Somehow along the way these internal elastic ‘cloud’ capabilities were exposed to external customers and the rest is history.

It’s indeed a fascinating inside look at how the AWS cloud was born. If you’ve wondered how come Amazon, an online book retailer, ended up being a cloud computing powerhouse then this video will give you some of the answers.

Resources for Amazon Web Services (AWS) migrations from EC2-Classic to EC2-VPC

AWS EC2 (Elastic Compute Cloud) is one of the best known cloud services delivered by AWS. It is a service that allows customers to purchase resizable cloud hosting resources. It is – in my opinion – the best current implementation of IaaS (Infrastructure as a Service) from all cloud service providers. It truly delivers on the promise that within minutes you could spin up a new server instance and proceed with meaningful work without having to wait for days/weeks for a vendor (or IT department) to purchase and deliver properly configured hardware systems.

The original implementation of EC2 (which over time became known as EC2-Classic) had one interesting shortcoming that only became more and more apparent as customers continued to build solutions increasing in complexity: EC2 instances for ALL customers in a given AWS geographic region share private IP addresses in the 10.x.x.x space (technically there are some ranges from that Class A network that are not being used but that’s not relevant for this discussion). For example you could have in your account a web server with the IP address of 10.150.22.220 and another web server with the IP address of 10.20.100.70 (and you basically had no idea or control over who would be using the private IP address of 10.20.100.71).

AWS did provide the technology of security groups (basically software firewalls that wrap around instances) to allow customers to group together EC2 instances of similar functions (and not allow intruders access) but as customers were building more and more complex solutions it was getting harder and harder to manage the instances that one owned in AWS.

This model where one’s servers were spread all over the 10.x.x.x range was not the way that networking professionals were used to run networks in their own data centers.

In 2009 AWS introduced an improvement to the original EC2 approach – VPC (Virtual Private Cloud). In a VPC customers now had the ability to use their own private IP range, divide the network as they saw fit and pretty much come back to networking models that they were used to. There are many, many more features to AWS VPC that make it a clear winner over EC2-Classic but those are not the main focus of this article.

For a while then we had the two technologies side by side – EC2-Classic and EC2-VPC. Customers were able to create EC2 instances using either model. It was becoming clear though that EC2-VPC was the superior technology and AWS proved that on 2013-12-04 because after that date all new AWS accounts only supported EC2-VPC. In new accounts created after that date AWS automatically creates a default VPC and places all EC2 instances in that context.

Many of the older AWS customers were slowly faced with a dilemma: what were they supposed to do with their aging EC2-Classic instances? AWS is constantly innovating, adding new instances types and new features but many of those are now only available on the EC2-VPC side. If customers want to take advantage of the latest AWS features then they need to consider migration paths from EC2-Classic instances to EC2-VPC.

So what are the migration options available?

Read more

List of technology podcasts

I’m a full believer in the term “Automobile University” that was coined by the well-known motivational speaker Zig Ziglar – the idea being that time spent in traffic can and should be used to educate oneself on a variety of subjects. As such, I have quite a few technology / IT podcasts that I subscribe to and I make sure that my audio player always has plenty of interesting episodes available in the queue.

I already shared the list below with plenty of friends and co-workers who know that I listen to a variety of podcasts so I figured I should probably just create a blog post with this information for future reference.

At the time of this post all podcasts mentioned below appear to still be active – kudos and many thanks to all these authors who keep creating solid technical content for all of us to enjoy.

In the list below I link to the actual podcast sites. If you want to actually subscribe to them you should be able to find them in the iTunes store or wherever else you grab your podcasts from. If you know of any other ones in the various categories listed below please add them in the comments.

Cloud Computing

Big Data

Databases (SQL Server)

Development (.NET, Javascript)

Infrastructure, Networking, Enterprise Tech

Security

I should also mention the amazing list of shows / podcasts from the TWiT network. Leo Laporte and his amazing crew create some awesome content – way faster than I can possibly consume it. You’ll probably find some interesting topics there as well.

Relationship between AWS EBS snapshots and EBS volume failure rates

I’ve been doing some research lately on best practices that pertain to the correct use of AWS EBS volumes from the point of view of data redundancy – in particular around the relationship between combining EBS volumes in a software RAID volume at the OS level (either RAID 1 or RAID 10) and the proper use of EBS snapshots.

On the AWS EBS details page (http://aws.amazon.com/ebs/details/) in the section on “Amazon EBS Availability and Durability” we find the following:

Amazon EBS volumes are designed to be highly available and reliable. At no additional charge to you, Amazon EBS volume data is replicated across multiple servers in an Availability Zone to prevent the loss of data from the failure of any single component. For more details, see the Amazon EC2 and EBS Service Level Agreement.

The durability of your volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot. As an example, volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS Snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume. This compares with commodity hard disks that typically fail with an AFR of around 4%, making EBS volumes 10 times more reliable than typical commodity disk drives.

The statement that really puzzled me for a while was the claim that EBS volumes are more durable (with a lower failure rate) when EBS snapshots for them are created more often. It just doesn’t appear that the two would be related … at least not when we think about drives and backups in a non-cloud, non-redundant way. I thought for a while that this was just some strange marketing statement made by AWS – almost like a reverse Murphy’s law: the more backups/snapshots you make the lower the chance that your drive will fail. I understand that having more frequent EBS snapshots of a volume would enable one to restore more recent versions of that volume’s data from the snapshot but how exactly could taking snapshots affect the rate at which a volume would physically fail?

Read more