In part one of this series I used recent increases in maximum database size as a driver for introducing the challenges of hyperscale computing. In this part we dive into the heart of the matter, which is what it takes to operate at hyperscale. Where hyperscale computing begins is an academic question, and the lessons here can be applied to modest numbers of computer systems as well as huge numbers. The difference is that with modest numbers you have choices, with huge numbers (as you shall see) you really don’t. For our purposes we will assume hyperscale means at least 100s of thousands of “systems”, and will use 1 Million Virtual Machines (instances or virts) as a good order of magnitude for illustration. To put this in context, AWS has millions of customers and they each have at least one, and probably many, instances. Even when a customer is using something that is “serverless”, there are instances behind the scenes. So rather than being far-fetched, 1 Million is a good order of magnitude to focus on.
Say you are a DBA dedicated to the care and feeding of an important database. Nightly backups of that database fail (meaning need human intervention) 1 in 1000 times, so you get paged about a failed backup once every three years. You sleep well. Or you are responsible for 100 databases. With a 1 in 1000 failure rate you are being paged every 10 days. Still not too bad. How about 1000 databases? Now you are being paged for a failure every day, 365 days per year. This is starting to not be any fun. How well do you sleep knowing that at some point during the night your pager will go off and you will have to work for minutes to hours? At this point one “primary” responder (be that a DBA, Systems Engineer, SDE, etc.) isn’t even possible, you need at least two so someone is always available to deal with failures. Really you need at least three, and by some calculations four to five (when you factor in vacations, health issues, turnover, etc.).
How about 1 million database instances? At our 1 in 1000 failure rate you need to handle 1000 failures per day! This turns into an army of people doing nothing but responding to backup failures. How big of an army? Let’s say a backup failure can be resolved in 15 minutes, so one person can handle 4 failures an hour. They handle failures 7 hours (assuming 1 for lunch, breaks, etc.) a shift, so 28 failures each. That translates to 36 people dedicated to handling backup failures each and every day. To achieve that you would need an overall team size of between 108 and 180.
Is a team of 180 people to handle backup failures practical? Is it cost-effective? Does anyone really want to burden their cost structure with all these people? Your organization wouldn’t let you hire them. Your public cloud provider is going to have to include their costs in its pricing, so you will be paying for them. Can you really hire and maintain large numbers of people willing and able to do this work? It’s a real challenge.
A quick example of the cost issue. An Amazon RDS MySQL t2.micro instance you are paying for on a 3 Year All-Upfront Reserved Instance basis (i.e., the lowest public price) costs 18.5 CENTS PER DAY. So AWS grosses $185 a day for 1000 instances. Doing a back of the envelope calculation let me postulate the fully burdened cost of resolving the 1 failed backup a day for those 1000 instances is $90. That leaves $95 a day to cover all hardware and infrastructure costs, other failure conditions, cost of sales, software development, etc. In other words, it’s a huge money losing proposition. And that doesn’t even take into account the cost hit on the many t2.micros being used as part of the AWS Free Tier.
So what makes more sense as a tolerable failure rate for backups at hyperscale? To get back to the point where someone is paged once per day you’d need a failure rate of 1 in a million. Would it be reasonable at the million (or low millions) of instances to have a team of 3-5 people who handled failures? Perhaps. But the story doesn’t end there.
Lets talk about log backup failures. Databases offer Point-In-Time-Recovery (PITR), and if you want that to be within 5 minutes, it means you need to back up the log files at least that often. That’s 20 times per hour. So at 1 million instances you are doing 20 million log backups per hour. Yup, half a billion operations per day! So even at a 1 in a million failure rate, you still would be seeing 480 failures a day that needed a human being to step in. And we haven’t even begun discussing anything other than backup! This suggests that our target failure rate should not be 1 in a million, but rather 1 in a billion.
Of course, if we are already talking about a million instances, and we all know how fast the cloud is growing, then we are looking at where the puck is now while we should be focused on where the puck is going. We probably should be thinking about tens of millions of instances, and targeting failure rates of 1 in 10 billion, 1 in 100 billion, or even 1 in a trillion operations.
Earlier I made an assumption that a backup failure could be resolved in 15 minutes. There are a lot of assumptions built into that number. While I’m sure every DBA has had the experience that they looked at an issue, immediately recognized the problem, and ran a script to resolve it, they have also had the experience of spending hours or days resolving and cleaning up after a failure. We’ve known since the 80s that computer failures are largely the result of human error, and have been working ever since to address that. So not only do you have to target failure rates of 1 in billions, you have to target reducing the cost and potential for negative impact by human beings when they do have to get involved. And you need to do this in the context of very high security, availability, and durability goals.
I am using databases as an example to drive the discussion, but all of this applies to any area of hyperscale computing. At re:Invent 2017 AWS’ CISO, Stephen Schmidt, strongly made the point that AWS does not have a Security Operations Center. He talked some about how this is achieved, and Distinguished Engineer Eric Brandwine offered a deeper look. I wonder how low a failure rate they had to achieve to make it possible to eliminate the SOC?
In the next segment of this series I’ll dive into how the need to both achieve very low failure rates, and make resolution of those failures fast and foolproof, comes through in public cloud database offerings. That will cover some generic design patterns, but also deal specifically with the behaviors and feature sets of managed database services.