I wanted to make some comments on Service Level Agreements (SLAs), so we interrupt our scheduled Part 2 on 16GB Cloud Databases. A Service Level Agreement establishes both an expectation between a service provider and a customer of the level of service to be provided, and often a contractual commitment as well. There are three ways to establish an SLA. First, you can just pull it out of your a**. Basically the customer says I want an availability SLA of 99.9999999 and you say “Yes, Sir!”, even though that is impossible to deliver. Maybe when it comes to contractual commitments you include so many exclusions that it becomes possible (e.g., “outages don’t count against availability calculations for SLA purposes”, would be a good start). Second, you can figure out what is theoretically possible based on your design. I’d also prefer my SLAs be based on actual data, not just what math says should be possible. So the third way is math plus data. But even that turns out to be nuanced. You can influence it both by the exclusions (e.g., customer caused outages don’t count is a pretty obvious, and valid, one), and by what penalties you are willing to accept when you miss the SLA.
When you miss an SLA you are penalized in two ways. Contractually there may be financial penalties, such as a 10% reduction in your bill, for missing the SLA. An SLA will eventually be breached. When you establish the SLA based on data and math, you know what the financial penalties of those breaches will be. You can pick the SLA based on what level of financial cost you are willing to accept. In other words, SLA breaches just become a cost of doing business. What’s the difference between an SLA calling for 99.9%, 99.95%, 99.99%, or 99.999% uptimes? Just an increase in your cost of good sold.
The second penalty is reputation risk. When you breach an SLA it causes harm to your reputation. If a customer runs years before having an SLA breach, that breach does little to damage your customer relationship. As long as you don’t breach the SLA again for a long time. If you breach SLAs frequently, customers learn they can’t trust your service. They may even seek alternatives.
Customers don’t even care about the financial penalties of an SLA breach. Those are trivial compared to the cost of the breach to their business. Meeting the SLA is what they really want. They see the financial penalty as an incentive for you to meet your SLA. The service provider’s accountants and lawyers will certainly want to make sure the business plans accomodate the SLA breaches, but as long as it does they will accept the SLA breaches.
A service provider willing to absorb a higher financial penalty from SLA breaches, and with a low concern for reputational risk, can set an SLA that they can’t consistently meet. A service provider with great concern for reputational risk will set an SLA they can consistently meet, even if it means that SLA is lower than its competitors. The former favors the marketing advantage of a high SLA, the latter favors actual customer experience.
Which would you rather have, a service that claims 99.999% availability but only delivers it 99.9% of the time, or one that claims 99.99% availability and delivers it 99.99% of the time? The 5 9s SLA sounds great, but it has 10x the breaches of the 4 9s SLA! Do you want an SLA that your service provider almost always meets or one that sounds, and is, too good to be true?
Personally I’ll take the consistent SLA, for two reasons. First, because I can and will design around an SLA I can trust. But one that is fictional will cause me to make bad decisions. Second, because the service provider giving me an SLA that will reflect my actual experience is a service provider I can trust.
Bottom line, take SLAs with a large grain of salt. Particularly when you can’t tell how often the SLA is breached. Moreso if a service provider offers an SLA before having gained a significant amount of operational experience. And if you can get a service provider to tell you how often they breach their SLA, more power to you.