Back in Part 2 I discussed the relationship between failures and the people resources needed to address them, and demonstrated why at hyperscale you can’t use people to handle failures. In this part I’ll discuss how that impacts a managed service. If you’ve wondered why it takes time, sometimes a seemingly unreasonable amount of time, for a new version to be supported, why certain permissions are withheld, why features may be disabled, etc. then you are in the right place.
tl;dr At hyperscale you need extreme automation. That takes more time and effort than those who haven’t done it can imagine. And you have to make sure the user can’t break your automation.
We probably all have used automation (e.g., scripts) at some point in our careers to accomplish repetitive operations. In simple cases we do little or no error handling and just “deal with it” when the script fails. For more complex scripts, perhaps triggered automatically on events or a schedule, we put in some simple error handling. That might just focus on resolving the most common error conditions, and raising the proper notifications for uncommon or otherwise unhandled errors. Moreover, the scripts are often written to manage resources that we (or a small cadre of our co-workers) own. So a DBA might create a backup script that is used to do backups of all the databases owned by their team. If the script fails then they, or another member of their team, are responsible for resolving the situation. If the team makes a change to a database such that the scripts fail, the responsibility for resolving the issue remains with them. This can be as human intensive or as automated as your environment supports, because it all rests with the same team.
In the case of a managed service the operational administration (“undifferentiated heavy lifting” such as backups, patching, failover configuration and operation, etc.) of the database instance is separated from the application-oriented administration (application security, schema design, stored procedure authoring, etc.). The managed service provider creates automation around the operational administration, automation that must work against a vast number (i.e., “millions” was where we ended up in Part 2) of databases owned by a similarly large number of different organizations.
In Part 2 I demonstrated that the Escaped Failure Rate (EFR), that is the number of failures that required human intervention, had to be 1 in 100 Billion or better in order to avoid the need for a large human shield (and the resulting costs) to address those failures. Achieving 1 in 100 Billion requires an extreme level of automation. For example, there are failure conditions which occur so infrequently that a DBA or System Engineer might not see them in their entire career. At hyperscale, that error condition might present itself several times per day and many times on a particularly bad day. As an analogy, you are unlikely to be hit by lightning in your lifetime. But it does happen on a regular basis, and sometimes a single strike can result in multiple casualties (77 in one example). At hyperscale on any given day there will be a “lightning strike”, and occasionally there will be one resulting in mass “casualties”. So you need to automate responses for conditions that are exceedingly rare as well as those that are common.
As the level of automation increases you have to pay attention to overall system complexity. For example, if you are a programmer then you know that handling concurrency dramatically increases application complexity. And DBAs know that a whole bunch of the complex work in database systems (e.g., the I in ACID) is focused on supporting concurrent transactions. When thinking about automation, you make it dramatically more complex by allowing concurrent automation processes. In other words, if you allow concurrent automation processes against the same object (e.g., a database instance) then you have to program them to handle any cases where they might interfere with one another. For any two pre-defined processes, assuming they have no more than modest complexity, that might be doable. But as soon as you allow a more general case the ability to ensure the concurrent processes can successfully complete, and complete without human intervention, becomes impractical. So when dealing with any one thing, for example a single database instance, you serialize the automation.
I kicked this series off discussing database size limits. The general answer for why size limits exist is the interaction between the time it takes to perform a scale storage operation and how long you are willing to defer execution of other tasks. Over time it became possible to perform scale storage on larger volumes within an acceptable time window, so maximum size was increased. With the advent of EBS Elastic Volumes the RDS automation for scale storage can (in most cases) complete very quickly. As a result they don’t block other automation tasks, enabling 16TB data volumes for RDS instances.
The broader implications of the requirements for extreme automation are:
- If you can’t automate it, you can’t ship it
- If a user can interfere with your automation, then you can’t deliver on your service’s promises, and/or you can’t achieve the desired Escaped Failure Rate, and/or they will cause your automation to actually break their application
- A developer is able to build a feature in a couple of days that might take weeks or months of effort to sufficiently automate before being exposed in a hyperscale environment
One of the key differences that customers notice about managed database services is that the privileges you have on the database instance are restricted. Instead of providing the administrative user with the full privileges of the super user role (sysadmin, sysdba, etc.) of the database engine, Amazon RDS provides a Master user with a subset of the privileges those roles usually confer. Privileges that would allow the DBA to take actions that break RDS’ automation are generally excluded. Likewise, customers are prohibited from SSHing into the RDS database instance because that would allow the customer to take actions that break RDS’ automation. Other vendors’ managed database services have identical (or near identical) restrictions.
Let’s take a deeper look at the implication of restricted privileges and lack of SSH and how that interacts with our efforts to limit EFR. When a new version of software is released it always comes with incompatibilities with earlier versions (and bugs of its own of course). A classic example is where a new version fixes a bug with an older version. Say a newer version of database engine X either fixes a bug where X-1 was ignoring a structural database corruption, or introduces a bug where X can’t handle some condition that was perfectly valid in X-1. In either case, the upgrade in place process for taking a database from X-1 to X fails when the condition exists, leaving the database inaccessible until the condition is fixed. To fix this you have to SSH into the instance and/or access resources that are not accessible to you. Now, let’s say this happens in 1 out of 1000 databases. If the service provider doesn’t automate the handling of this condition then, since the customer can’t resolve it themselves, the service provider will need to step in 1000 times for the 1 million instance example. Did you read Part 2? That’s not a reasonable answer in a hyperscale environment. So the managed service can’t offer version upgrade in place until they’ve both uncovered these issues, and created automation for handling them.
Similar issues impact the availability of new versions of database software (even without upgrade in place). Changes (features or otherwise) that impact automation, be that creation of new automation or changes to existing automation, have to be analyzed and work completed to handle those changes. Compatibility problems that will break currently supported configurations have to be dealt with. Performance tuning of configurations has to be re-examined. Dependencies have to be re-examined. Etc. And while some of this can be done prior to a database engine’s General Availability, often changes occur late in the engine’s release cycle. A recent post in the Amazon RDS Forum was complaining about RDS’ lack of support for MySQL 8.0, which went GA last April. So I checked both Google Cloud SQL and Microsoft Azure Database for MySQL and neither of them supported MySQL 8.0 yet either. To be supportable at hyperscale, new releases require a lot of work.
Let me digress here a moment. The runtime vs. management dichotomy goes back decades. With traditional packaged software the management tools are usually way behind in supporting new runtime features. With Microsoft SQL Server, for example, we would constantly struggle with questions like “We don’t have time to create DDL for doing this, so should we just expose it via DBCC or an Extended Stored Procedure?” or “This change is coming in too late in the cycle for SSMS support, is it ok to ship without tool support?” or “We don’t have time to make it easy for the DBA, so should we just write a whitepaper on how to roll your own?” The SQL Server team implemented engineering process changes to improve the situation, basically slowing feature momentum to ensure adequate tools support was in place. But I still see cases where that doesn’t happen. With open source software (including database engines), the tooling often comes from parties other than the engine developers (or core community)so the dichotomy remains.
It’s not just that management support can’t fully be done until after the feature is working in the database engine (or runtime or OS or…), it is that for many features the effort to provide proper management exceeds the cost of developing the feature in the first place. On DEC (nee Oracle) Rdb I was personally involved in cases where I implemented A runtime feature in a couple of hours that turned into many person days of work in tools. Before I joined AWS I noticed that RDS for SQL Server didn’t support a feature that I would expect to be trivial to support. After I joined I pressed for its implementation, and while not a huge effort it was still an order of magnitude greater than I would have believed before actually understanding the hyperscale automation requirements. So while I’m writing this blog in the context of things running at hyperscale, all that has really changed in decades is that at hyperscale you can’t let the management aspects of software slide.
There is a lot more I could talk about in this area, but I’m going to stop now since I think I made the point. At hyperscale you need ridiculously low Escaped Failure Rates. You get those via extensive automation. To keep your automation operating properly you have to lock down the environment so that a user can’t interfere with the automation. That locked down environment forces you to handle even more situations via additional automation.
When all this works as intended you get benefits like I described years ago in a blog I wrote about Amazon RDS Multi-AZ . You also get to have that managed high availability configuration for as little as $134 a year, which is less than the cost of an hour of DBA time. And the cloud providers do this for millions of instances, which is just mind-boggling. Particularly if you recall IBM Founder Thomas Watson Sr’s most famous quote, “I think there is a world market for maybe five computers.”