My Mobile Phone is Sacrosanct

Sorry for my absence the last few weeks, I’ve been rather busy with a move.  I’ll try to get back to regular blogging, and I have a huge backlog of topics.  Here is a short one, the importance of my mobile phone has reached the level where I am reluctant to take risks with it.  And that is causing problems.

I recently decided not to enable my cell phone to connect to a client’s email system. Like most organizations, my client’s IT organization requires any device accessing its email system to submit to its Mobile Device Management (MDM) regime. For the most part that is not a problem as I already manage my phone that way, for example requiring a PIN to unlock it, and having the device set to erase itself after a number of failed PIN entries. The usual MDM regime has one “feature” I can no longer tolerate, the ability for the organization to erase the contents of your mobile device at its discretion. And, in particular, at termination of “employment”. If I were a full-time employee, expecting to retain that status for an indefinite (i.e., multi-year) period, that might not be such a big thing. But as a consultant my access to the client’s email system might not last beyond a few months, or could even last just a few weeks. Then my phone would be wiped.

Up until recently I didn’t really care about wiping my phone, because everything really lives in the cloud.  Or so I used to think.  I would regularly switch devices, and all my important data, emails, etc. would be available on the new device.  Thank you OneDrive, OneNote, Cloud Drive, Exchange, iCloud, etc.  But increasingly there is something critical that is local only, two-factor authentication (2FA).  My phone has become my identity.

My phone has been used as a 2FA device for a long time, with many sites texting me a code I had to enter for login (or authorization of certain actions).  And if that was the extent of it then wiping the device isn’t really a problem since the phone and SIM retain the phone’s physical identity.  But recently more and more sites are depending on authentication apps running on the device and maintaining local state.  For example, Microsoft’s Authenticator, Google Authenticator, MobilePass+, etc.  Lose one of those apps and re-acquiring access to the sites that were being protected is a nightmare.

Not long ago I accidentally deleted an authentication app and discovered it would take at least 24 hours to re-acquire access to the account it protected.  Basically the sites recovery process was to insert a 24 hour delay between the request to turn off 2FA and it take effect.  This was done in the name of security.  Then you had a few hours to access the site with a temporary code, before that code became invalid.  Then you had to request a new code, which came 24 hours later, and so on.  I was always busy when that code appeared, so it took days to regain access.  Yeah, this is an extreme example.  But not the only one.  Since the purpose of 2FA is to provide very strong access control, recovery from loss of a 2FA device is almost always intentionally very difficult.

I was about to make the final tap on my phone to add the client’s email system when the impact of having my phone wiped hit home.  I would immediately lose access to most of my life.  My personal email, my bank accounts, even Twitter.  Losing access to my email would be the worst, because the recovery processes for most things go through email.  It would take me days, of effort to put my digital life back together.  The process would spin further out of control if I didn’t have other devices with me, or they too were wiped.  For example, if my iPad were wiped at the same time for the same reason.   I’d be living a dystopian nightmare.  I cancelled connecting my phone to their email system.

This is all starting to have negative impact, something that will only grow as our phones become more a part of our identity.  I’ve missed time-dependent mails from the client because I either need to log in with OWA (which needs 2FA of course), or use my iPad (which I did connect to their email system). I have become reluctant to upgrade my phone, because that creates the same situation.  I’d have to pre-plan the upgrade, turning off 2FA where possible and scheduling time to go through the replacement process where it isn’t.  I’ve even turned off the auto-wipe feature, because the impact of someone wiping out my identity is now greater than the likeliness they can break into the phone before I do my own remote wipe (or otherwise disable the phone’s access to my resources).

I know I’m going to hear from people that they use solutions like carrying two phones with them, one for work and one for personal use.  That doesn’t work for me, and only addresses the catalyst for this post rather than the core issue.  A better solution for the work/personal data problem is for efforts to compartmentalize work data on a personal device to become ubiquitous.  Your employer would never have, nor need, the right to wipe your entire device but rather have a way to wipe just their data.  But that doesn’t go far enough.

Are their mechanism to get around the loss of a 2FA device?  Sure.  My Twitter backup codes are sitting in a safe 2000 miles from where I’m writing this.  Not too useful a mechanism.  Well, why not store them online somewhere?  Ok, in the case of just losing 2FA access to Twitter that would work.  In the case of my phone being wiped I would lose access to the store I had them in.  Put them in a store that doesn’t require 2FA?  Umm, remind me why we are doing 2FA to begin with?

Authy, an authentication app that has multi-device support and secure cloud backup is probably the best current approach, to the extent that it can be used to replace the other authentication apps.  But it can’t always (e.g., I don’t think it can replace MobilePass+, which is often used for Enterprise network access).  It also isn’t clear that Authy, or a similar 3rd party HOTP/TOPT app, will play a part in future authentication mechanisms.  As Microsoft, for example, moves away from the use of passwords its solution may require the Microsoft Authenticator app rather than allow for Google Authenticator, Authy, etc. as alternatives.

As we continue the rapid move to our phones being our identities, every identity provider needs to provide a more robust way to recover from the loss of phones.  But for now, I’m treating my phone as sacrosanct.  No you can’t have permission to erase its contents.  And no, I’m no longer upgrading my phone frequently.

Posted in Computer and Internet, Mobile, Security | Tagged , , | 4 Comments

Challenges of Hyperscale Computing (Part 2)

In part one of this series I used recent increases in maximum database size as a driver for introducing the challenges of hyperscale computing.  In this part we dive into the heart of the matter, which is what it takes to operate at hyperscale.  Where hyperscale computing begins is an academic question, and the lessons here can be applied to modest numbers of computer systems as well as huge numbers.  The difference is that with modest numbers you have choices, with huge numbers (as you shall see) you really don’t.  For our purposes we will assume hyperscale means at least 100s of thousands of “systems”, and will use 1 Million Virtual Machines (instances or virts) as a good order of magnitude for illustration.  To put this in context, AWS has millions of customers and they each have at least one, and probably many,  instances.  Even when a customer is using something that is “serverless”, there are instances behind the scenes.  So rather than being far-fetched, 1 Million is a good order of magnitude to focus on.

Say you are a DBA dedicated to the care and feeding of an important database.  Nightly backups of that database fail (meaning need human intervention) 1 in 1000 times, so you get paged about a failed backup once every three years.  You sleep well.  Or you are responsible for 100 databases.  With a 1 in 1000 failure rate you are being paged every 10 days.  Still not too bad.  How about 1000 databases?  Now you are being paged for a failure every day, 365 days per year.  This is starting to not be any fun.  How well do you sleep knowing that at some point during the night your pager will go off and you will have to work for minutes to hours?  At this point one “primary” responder (be that a DBA, Systems Engineer, SDE, etc.) isn’t even possible, you need at least two so someone is always available to deal with failures.  Really you need at least three, and by some calculations four to five (when you factor in vacations, health issues, turnover, etc.).

How about 1 million database instances?  At our 1 in 1000 failure rate you need to handle 1000 failures per day!  This turns into an army of people doing nothing but responding to backup failures.  How big of an army?  Let’s say a backup failure can be resolved in 15 minutes, so one person can handle 4 failures an hour.  They handle failures 7 hours (assuming 1 for lunch, breaks, etc.) a shift, so 28 failures each.  That translates to 36 people dedicated to handling backup failures each and every day.  To achieve that you would need an overall team size of between 108 and 180.

Is a team of 180 people to handle backup failures practical?  Is it cost-effective?  Does anyone really want to burden their cost structure with all these people?  Your organization wouldn’t let you hire them.  Your public cloud provider is going to have to include their costs in its pricing, so you will be paying for them.  Can you really hire and maintain large numbers of people willing and able to do this work?  It’s a real challenge.

A quick example of the cost issue.  An Amazon RDS MySQL t2.micro instance you are paying for on a 3 Year All-Upfront Reserved Instance basis (i.e., the lowest public price) costs 18.5 CENTS PER DAY.  So AWS grosses $185 a day for 1000 instances.  Doing a back of the envelope calculation let me postulate the fully burdened cost of resolving the 1 failed backup a day for those 1000 instances is $90.  That leaves $95 a day to cover all hardware and infrastructure costs, other failure conditions, cost of sales, software development, etc.  In other words, it’s a huge money losing proposition.  And that doesn’t even take into account the cost hit on the many t2.micros being used as part of the AWS Free Tier.

So what makes more sense as a tolerable failure rate for backups at hyperscale?  To get back to the point where someone is paged once per day you’d need a failure rate of 1 in a million.  Would it be reasonable at the million (or low millions) of instances to have a team of 3-5 people who handled failures?  Perhaps.  But the story doesn’t end there.

Lets talk about log backup failures.  Databases offer Point-In-Time-Recovery (PITR), and if you want that to be within 5 minutes, it means you need to back up the log files at least that often.  That’s 20 times per hour.  So at 1 million instances you are doing 20 million log backups per hour.  Yup, half a billion operations per day!  So even at a 1 in a million failure rate, you still would be seeing 480 failures a day that needed a human being to step in.  And we haven’t even begun discussing anything other than backup!  This suggests that our target failure rate should not be 1 in a million, but rather 1 in a billion.

Of course, if we are already talking about a million instances, and we all know how fast the cloud is growing, then we are looking at where the puck is now while we should be focused on where the puck is going.  We probably should be thinking about tens of millions of instances, and targeting failure rates of 1 in 10 billion, 1 in 100 billion, or even 1 in a trillion operations.

Earlier I made an assumption that a backup failure could be resolved in 15 minutes.  There are a lot of assumptions built into that number.  While I’m sure every DBA has had the experience that they looked at an issue, immediately recognized the problem, and ran a script to resolve it, they have also had the experience of spending hours or days resolving and cleaning up after a failure.  We’ve known since the 80s that computer failures are largely the result of human error, and have been working ever since to address that.  So not only do you have to target failure rates of 1 in billions, you have to target reducing the cost and potential for negative impact by human beings when they do have to get involved. And you need to do this in the context of very high security, availability, and durability goals.

I am using databases as an example to drive the discussion, but all of this applies to any area of hyperscale computing.  At re:Invent 2017 AWS’ CISO, Stephen Schmidt, strongly made the point that AWS does not have a Security Operations Center.  He talked some about how this is achieved, and Distinguished Engineer Eric Brandwine offered a deeper look.  I wonder how low a failure rate they had to achieve to make it possible to eliminate the SOC?

In the next segment of this series I’ll dive into how the need to both achieve very low failure rates, and make resolution of those failures fast and foolproof, comes through in public cloud database offerings.  That will cover some generic design patterns, but also deal specifically with the behaviors and feature sets of managed database services.



Posted in Amazon, AWS, Azure, Cloud, Computer and Internet, Microsoft

Microsoft “can’t win for losing”

When it comes to the consumer, Microsoft’s history can best be described as “I got it. I got it. I got it. <THUMP> I ain’t got it.”.  Today is the 4th anniversary of my Xbox: Fail blog post, and this week Microsoft put the final nail in the coffin of Kinect.  So it really is an appropriate point to talk about Microsoft and the consumer.  Microsoft is not a consumer-focused company, and never will be despite many attempts over the decades.  Recognition of this reality, and an end to tilting at windmills, is one of the things that Satya Nadella seems to have brought to the table.

First let’s get something out of the way, we need to refine what we mean by the label “consumer”.  It isn’t simply the opposite of business/organizational users.  Microsoft has always done just fine in providing individuals with personal productivity and content creation tools.  The Windows-based PC remains at the center of any complex activity.  Sure I book some flights on my iPhone or iPad.  But when I start putting together a complex multi-leg trip the PC becomes my main tool.  Office has done well with consumers, and continues to do so in spite of popular free tools from Google.  And over the last few years Microsoft has gained traction with the artistic/design crowd that had always gravitated towards the Mac.  So when we talk about the consumer we really are talking experiences  that are left of center on the content consumption to content creation spectrum.  Microsoft will always be a strong player on the right of center content creation scale, be it for individuals, families, or organizations.  But, other than console gaming, they aren’t going to be a significant player on the left of center experiences.  And Microsoft fans are going to go crazy over that.

The end of life for Kinect is the perfect illustration of Microsoft’s inability to be a consumer player.  The Xbox One with (then mandatory) Kinect was introduced a year before the Amazon Fire TV and a year and half before the Amazon Echo.  It was originally tasked with becoming the center of home entertainment, and offered a voice interface.  Go read my Xbox: Fail piece for how it wasn’t ready to live up to that design center.  It’s pretty typical Microsoft V1 stuff.  Unfortunately the Xbox One was also V1 from a console gaming perspective, so Microsoft focused on making it more competitive in that niche and abandoned pushing forward on the home entertainment side.  Imagine that, Microsoft had a beachhead of 10s of millions of voice-enabled devices in place before Amazon even hinted at the Echo, and failed to capitalize on it.  You can repeat that story many times over the last 25 years.

It isn’t that Xbox One was the perfect device for the coming voice assistant, or streaming TV, revolutions.  The need to be a great gaming console gave it much too high a price point for non-gamers.  But Microsoft could have continued to evolve both the experience and produced lower priced, non-gaming focused, hardware.  Contrast what Microsoft did with what Amazon did around the Echo.  When the Echo was introduced it was considered a curiosity, a niche voice-operated speaker for playing music.  When Amazon started to gain traction with the Echo and Alexa, they went all in, and as a result have a strong lead in today’s hottest segment of the consumer technology space.  It reminded me a lot of Microsoft’s pivot to the Internet back in 1995.  But in the Xbox One case, Microsoft had the vision (at least in general direction), but failed to capitalize on it.  Failed to even make a serious attempt.  Now, at best, it could fight it out for a distant 4th or 5th place in voice assistants and home entertainment.  This consumer stuff just isn’t in Microsoft’s DNA.

The death of the Groove Music Service is another example, and maybe more telling on why Microsoft hasn’t been able to crack the code on the consumer.  Groove is just the latest name for Zune’s music service.  When MP3 players became popular Microsoft jumped on the bandwagon based on its DNA, it relied on 3rd parties that it supplied with technology (e.g., DRM).  When that didn’t even turn out to be a speedbump on the iPod’s adoption, it finally introduced the Zune as a first party device.  To have as good an experience as an iPod, the Zune needed an iTunes equivalent and what we now know as the Groove Music Service was born.  Despite the jokes that failure often leads to, the Zune was a quite nice device. But since it couldn’t play the music you’d acquired with iTunes there really was no iPod to Zune migration path.  By the time Zune came on the market the game was already over.  As Zune died other consumer-focused device efforts came to the fore (Kin, Windows Phone 7, Xbox One) and the music service lived on.  But since the devices never gained traction neither did the music service.  And for Microsoft the music service was never a player on its own, it was just a necessary evil to support its consumer device experience.  And with that mindset, the failure to gain traction with consumer devices meant Groove was superfluous.  Sure Groove could have owned the segments that Spotify and Pandora now dominate, but that was never what Microsoft was going for.  And now, it is too late.

Being a content creator or distributor is not in Microsoft’s DNA.  It has an immune system that rejects it time and time again.  Microsoft made a big play on consumer titles in the early to mid 90s, remember Microsoft Dogs and Encarta?  Offerings like these are very manpower intensive because they need a lot of content production, editing, frequent updating, sell for very little, are expensive to localize, and often don’t even make sense globally.  So Microsoft concluded they didn’t fit well with its business model and backed away from all but a few major titles such as Encarta.  While Encarta was great for its time, the Internet left it competing with Wikipedia.  That destroyed what little economic value Encarta had.  Other content-oriented efforts, such as Slate, were disposed of to save costs when the Internet Bubble burst.  The MSNBC joint venture was allowed to dissolve when its contract came up for renewal.  And so on.

I could even say that great end user experiences are not in Microsoft’s DNA, though that one is more debatable.  Usually it is thought of as being consistently second to Apple.  So rather than saying they aren’t in Microsoft’s DNA, I’d say that Microsoft user experiences are almost always compromised by more dominant aspects of its DNA.  And that keeps it from being a great consumer experience company.

What is Microsoft good at?  Creating platforms that others build on.  Doing work that is technically hard, and takes a lot of engineering effort, that it can sell over and over again.  High fixed cost, very low variable cost, very high volume, globally scalable has been its business model all along.  Consumer businesses usually have moderate to high variable costs, so there is problem number one.  Only the top two players in a segment usually can achieve very high volume, so unless Microsoft achieves leadership early in a segment it never can get high enough volume to have a successful business model.  A head-on charge against the established leaders rarely works, and when it does it is a financial bloodbath.  So you may not need to be the first in market, but you need to be in early enough for the main land grab (or wait for the next paradigm shift to try again).  And global scaling of consumer offerings is way more difficult than for platforms or business-focused offerings.

Microsoft seems to have resolved to focus on its DNA.  It will be supportive, even encouraging, of third parties who want to use its platforms to offer consumer services but avoid going after the consumer directly.  So you get a Cortana-enabled smart speaker from Harmon-Kardon, a high-end Cortana-enabled thermostat from Johnson Controls, a set of smart fixtures from Kohler that use Amazon’s Alexa for voice control but Microsoft Azure for the rest of their backend, and an agreement with Amazon for Cortana/Alexa integration.

Will Microsoft introduce consumer devices or services in the future?  Possibly, but they will suffer the same fate as its earlier attempts.  And I’m not throwing good money after bad (and I did throw a lot at every consumer thing Microsoft ever did).  I recognize that these attempts are at best trial balloons, and at worst ill-advised ventures by those intoxicated at the potential size of market.  Microsoft is an arms supplier.  It should supply arms to companies going after the consumer, but avoid future attempts to fight consumer product wars itself.




Posted in Computer and Internet, Home Entertainment, Microsoft | Tagged , , , | 11 Comments

Amazon moving off Oracle? #DBfreedom

A bunch of news stories, apparently coming off an article in The Information, are talking about Amazon and Salesforce attempting to move away from the use of Oracle.  I’m not going to comment specifically on Amazon, or Salesforce, and any attempt to move away from Oracle’s database.  But on that general topic.  And a little on Amazon (Web Services) in databases.

tl;dr It might not be possible to completely migrate off of the Oracle database, but lots of companies are capping their long term Oracle cost exposure.

There are a ton of efforts out there to make it easier for customers to move off of the Oracle database.  The entire PostgreSQL community has had making that possible as a key priority for many years.  There are PostgreSQL-derivatives like Enterprise DB’s Postgres Advanced Server that go much further than just providing an Oracle-equivalent.  They target direct execution of ported applications by adding PL/SQL-compatibility with its SPL, support for popular Oracle pre-supplied packages, offering an OCI connector, and other compatibility features.  Microsoft started a major push on migrating Oracle applications to SQL Server back in the mid-2000s with SQL Server Migration Assistant.  They re-invigorated that effort last year.  IBM has a similar effort for DB2, which includes its own PL/SQL implementation.  And, of course, the most talked about effort the last few years is the one by AWS.  The AWS Database Migration Service (DMS) and Schema Conversion Tool (SCT) have allowed many applications to be moved off of Oracle to other databases.  Including to Aurora MySQL, Aurora PostgreSQL, and Redshift which, take advantage of the cloud to provide enterprise-level scalability and availability without the Oracle licensing tax.

Note that Andy isn’t specifically saying 50K migrations off of Oracle, that’s the total number for all sources and destinations.  But a bunch of them clearly have Oracle as the source, and something non-Oracle as the destination.

On the surface the move away from Oracle database is purely a balance between the cost of switching technologies and the cost of sticking with Oracle.  Or, maybe in rare cases, the difficulty achieving the right level of technological parity.  But that isn’t the real story of what it takes to move away from Oracle.

Sure many apps can be manually moved over with a few hours or days of work.  Others can be moved pretty easily with the tooling provided by AWS or others, with days to weeks of work.  The occasional really complex app might take many person-months or person-years to move.  But if you have the source code, and you have (or can hire/contract) the expertise, you can move the applications.  And people do.  A CIO could look at spending say $5 Million or $25 million or $100 million to port its bespoke apps and think they can’t afford it.  Or they could look at that amount and say “ah, but then I don’t have to write that big check to Oracle every year”.  So if you think long-term, and hate dealing with Oracle’s licensing practices (e.g., audits, reinterpreting terms when it suits them, inviting non-compliance then using it to force cloud adoption, etc.), then the cost to move your bespoke applications is readily justified.  So what are the real barriers to moving off Oracle database?

Barrier number one is 3rd party applications.  Sometimes these aren’t a barrier at all.  Using Tableau?  It works with multiple database engines, including Amazon Redshift, PostgreSQL, etc.  Using ArcGIS?  It just so happens that PostgreSQL with the PostGIS extension is one of the many engines it supports.  Using Peoplesoft?  Things just got a bit more difficult.  Because Peoplesoft supported other database systems when Oracle acquired it there are options, but they are all commercial engines (e.g., Informix, Sybase, and of course Microsoft SQL Server) and I don’t know how well Oracle is supporting them for new (re-)installations.  You can’t move to an open source, or open source compatible, engine.  If you are using Oracle E-Business Suite?  You’re screwed, you can’t use any database other than the Oracle database.   Given that Oracle has acquired so many applications over the years, there is a good chance your company is running on some Oracle-controlled application.  And they are taking no steps to have their applications support any new databases, not even the Oracle-owned MySQL.

Oracle’s ownership of both the database and key applications has created a near lock-in to the Oracle database.  I say “near” because you can in theory move to a non-Oracle application, and may do so over time.  But when you’ve lived through stories of companies spending $100s of millions to implement ERP and CRM solutions, the cost of swapping out E-Business Suite or Siebel makes it hard to consider.  Without that, there goes complete elimination of your Oracle database footprint.

Now on to the second issue, Oracle’s licensing practices.  I’m not an Oracle licensing expert, so I will apologize for the lack of details and potential misstatements.  But generally speaking, many (if not most) customers have licensed the Oracle database on terms that don’t really allow for a reduction in costs.  Let’s say you purchased licenses and support for 10,000 cores.  You are now only using 1000 cores.  Oracle won’t allow you to just purchase support for 1000 cores, if you want support you have to keep purchasing it for the total number of core licenses you own.  And since they only make security patches available under a support contract, it is very hard to run Oracle without purchasing support.  If you have an “all you can eat” type of agreement, to get out of it you end up counting all the core licenses you currently are using.  You can then stop paying the annual “all you can eat” price, but you still have to pay for support for all the licenses you had when you terminated the “all you can eat” arrangement.  Even if you are now only using 1 core of Oracle.

To top it off, you can see how these two interact.  Even if just one third-party application keeps you using the Oracle database, you will be paying them support for every Oracle license you ever owned. Completely getting off Oracle requires a real belief that the short to mid-term pain is worth the long-term gain.

So does this “get off Oracle” thing sound hopeless?  NO.  For any healthy company, the number of cores being used grows year after year.  It doesn’t matter if you have an “all you can eat” agreement, all you are doing is committing yourself to an infinite life of high support costs.  What moving the moveable existing apps, and implementing new apps on open source/open source-compatible engines, allows you to do is stop growing the number of Oracle cores you license.  You move existing applications to PostgreSQL (or something else) to free up Oracle core licenses for applications that can’t easily be moved.  You use PostgreSQL for new applications, so they never need an Oracle core license.  You can’t eliminate Oracle, but you can cap your future cost exposure.  And then at some point you’ll find the Oracle core licenses represent small enough part of your IT footprint that you’ll be able to make the final push to eliminate them.

Switching topics a little, one of the most annoying things about this is the claim in some of the articles that Amazon needs to build a new database.  Hello?  AWS has created DynamoDB, RedShift, Aurora MySQL, and Aurora PostgreSQL, Neptune, and a host of other database technologies.  DynamoDB has roots in the NoSQL-defining Dynamo work, which predates any of this.  Amazon has a strong belief in NoSQL for certain kinds of systems, and that is reflected in the stats from last Amazon Prime Day.  DynamoDB handled 3.4 trillion requests, peaking at 12.9 million per second.  For those applications that want relational, Aurora is a great target for OLTP and RedShift (plus Redshift Spectrum, when you want to divorce compute and storage) for Data Warehousing.  You think the non-AWS parts of Amazon aren’t taking advantage of those technologies as well?  Plus Athena, Elasticache, RDS in general, etc.?  Puhleeze.

Posted in Amazon, Aurora, Computer and Internet, Database, Microsoft, SQL Server | Tagged , , | 2 Comments

Service Level Agreements (SLA)

I wanted to make some comments on Service Level Agreements (SLAs), so we interrupt our scheduled Part 2 on 16GB Cloud Databases.  A Service Level Agreement establishes both an expectation between a service provider and a customer of the level of service to be provided, and often a contractual commitment as well.  There are three ways to establish an SLA.  First, you can just pull it out of your a**.  Basically the customer says I want an availability SLA of 99.9999999 and you say “Yes, Sir!”, even though that is impossible to deliver.  Maybe when it comes to contractual commitments you include so many exclusions that it becomes possible (e.g., “outages don’t count against availability calculations for SLA purposes”, would be a good start).  Second, you can figure out what is theoretically possible based on your design.  I’d also prefer my SLAs be based on actual data, not just what math says should be possible.  So the third way is math plus data.  But even that turns out to be nuanced.  You can influence it both by the exclusions (e.g., customer caused outages don’t count is a pretty obvious, and valid, one), and by what penalties you are willing to accept when you miss the SLA.

When you miss an SLA you are penalized in two ways.  Contractually there may be financial penalties, such as a 10% reduction in your bill, for missing the SLA.  An SLA will eventually be breached.  When you establish the SLA based on data and math, you know what the financial penalties of those breaches will be.  You can pick the SLA based on what level of financial cost you are willing to accept.  In other words, SLA breaches just become a cost of doing business.  What’s the difference between an SLA calling for 99.9%, 99.95%, 99.99%, or 99.999% uptimes?  Just an increase in your cost of good sold.

The second penalty is reputation risk.  When you breach an SLA it causes harm to your reputation.  If a customer runs years before having an SLA breach, that breach does little to damage your customer relationship.  As long as you don’t breach the SLA again for a long time.  If you breach SLAs frequently, customers learn they can’t trust your service.  They may even seek alternatives.

Customers don’t even care about the financial penalties of an SLA breach.  Those are trivial compared to the cost of the breach to their business.  Meeting the SLA is what they really want. They see the financial penalty as an incentive for you to meet your SLA.  The service provider’s accountants and lawyers will certainly want to make sure the business plans accomodate the SLA breaches, but as long as it does they will accept the SLA breaches.

A service provider willing to absorb a higher financial penalty from SLA breaches, and with a low concern for reputational risk, can set an SLA that they can’t consistently meet. A service provider with great concern for reputational risk will set an SLA they can consistently meet, even if it means that SLA is lower than its competitors.  The former favors the marketing advantage of a high SLA, the latter favors actual customer experience.

Which would you rather have, a service that claims 99.999% availability but only delivers it 99.9% of the time, or one that claims 99.99% availability and delivers it 99.99% of the time? The 5 9s SLA sounds great, but it has 10x the breaches of the 4 9s SLA!  Do you want an SLA that your service provider almost always meets or one that sounds, and is, too good to be true?

Personally I’ll take the consistent SLA, for two reasons.  First, because I can and will design around an SLA I can trust.  But one that is fictional will cause me to make bad decisions.  Second, because the service provider giving me an SLA that will reflect my actual experience is a service provider I can trust.

Bottom line, take SLAs with a large grain of salt.  Particularly when you can’t tell how often the SLA is breached.  Moreso if a service provider offers an SLA before having gained a significant amount of operational experience.  And if you can get a service provider to tell you how often they breach their SLA, more power to you.





Posted in AWS, Azure, Cloud, Computer and Internet, Google | Tagged

16TB Cloud Databases (Part 1)

I could claim the purpose of this blog post is to talk about Amazon RDS increasing the storage per database to 16TB, and to some extent it is.  It’s also an opportunity to talk about the challenges of a hyperscale environment.  Not just the world of AWS, but for Microsoft Azure, Google Cloud, and others as well.  I’ll start with the news, and since there is so much ground to cover I’ll break this into multiple parts.

As part of the (pre-) AWS re:Invent 2017 announcements Amazon RDS launched support that increased maximum database instance size from 6TB to 16TB for PostgreSQL, MySQL, MariaDB, and Oracle.   RDS for Microsoft SQL Server had launched 16TB database instances back in August, but with the usual RDS SQL Server restriction of them not being scalable.  That is, you had to pick 16TB at instance create time.  You couldn’t take a 4TB database instance and scale its storage up to 16TB.  Instead you would need to dump and load, or use the Native Backup/Restore functionality, to move databases to the new instance.  If the overall storage increase for RDS was lost in the noise of all the re:Invent announcements, the fact that you could now scale RDS SQL Server database instance storage was truly buried.  The increase to 16TB databases benefits a small number of databases for a small number (relatively speaking) of customers, the scalability of SQL Server database instance storage benefits nearly all current and future RDS SQL Server customers.

While RDS instances have been storage limited, Amazon Aurora MySQL has offered 64TB for years (and Aurora PostgreSQL was also launched with 64TB support).  That is because Aurora was all about re-inventing database storage for the cloud. so it addressed the problems I’m going to talk about in its base architecture.  In the case of non-Aurora RDS databases, and Google’s Cloud SQL, Azure Database for MySQL (or PostgreSQL), and even Azure SQL Database (which despite multiple name changes over the years, traces its lineage to the CloudDB effort that originated over a decade ago in the SQL Server group) have lived with the decades old file and volume-oriented storage architectures of on-premises databases.

Ignoring Aurora, cloud relational database storage sizes have always been significantly limited compared to their on-premises instantiation.  I’ll dive into more detail on that in part 2, but let’s come up to speed on some history first.

Both Amazon RDS and Microsoft’s Azure SQL Database (then called SQL Azure) publicly debuted in 2009, but had considerably different origins.  Amazon RDS started life as a project by Amazon’s internal DBA/DBE community to capture their learnings and create an internal service that made it easy for Amazon teams to standup and run highly available databases.  The effort was moved to the fledgling AWS organization, and re-targeted to helping external customers benefit from Amazon’s learnings on running large highly-available databases.  Since MySQL had become the most popular database engine (by unit volume), it was chosen to be the first engine supported by the new Amazon Relational Database Service.  RDS initially had a database instance storage size limit of 1TB.  Now I’m not particularly familiar with MySQL usage in 2009, but based on MySQL’s history and capabilities in version 5.1 (the first supported by RDS), I imagine that 1TB covered 99.99% of MySQL usage.  RDS didn’t try to change the application model, indeed the idea was that the application had no idea it was running against a managed database instance in the cloud.  It targeted lowering costs while increasing the robustness (reliability of backups, reliability of patching, democratization of high availability, etc.) of databases by automating what AWS likes to call the “undifferentiated heavy lifting” aspects of the DBA’s job.

As I mentioned, Azure SQL started life as a project called CloudDB (or Cloud DB).  The SQL Server team, or more precisely remnants of the previous WinFS team, wanted to understand how to operate a database in the cloud.  Keep in mind that Microsoft, outside of MSN, had almost no experience in operations.  They brought to the table the learnings and innovations from SQL Server and WinFS, and decided to take a forward-looking approach.  Dave Campbell and I had spent a lot of effort since the late 90s talking to customers about their web-scale application architectures, and were strong believers that application systems were being partitioned into separate services/microservices with separate databases.   And then those databases were being sharded for additional scalability.   So while in DW/Analytics multi-TB (or in the Big Data era, PB) databases would be common, most OLTP databases would be measured in GB.  Dave took that belief into the CloudDB effort.  On the technology front, WinFS had shown it was much easier to build on top of SQL Server than to make deep internal changes.  Object relational mapping (ORM) layers were becoming popular at the time, and Microsoft had done the Entity Framework as a ORM for SQL Server.  Another “research” project in the SQL Server team had been exploring how to charge by units of work rather than traditional licensing.  Putting this altogether, the CloudDB effort didn’t go down the path of creating an environment for running existing SQL Server databases in the cloud.  It went down the path of creating a cloud-native database offering for a new generation of database applications.  Unfortunately customers weren’t ready for that, and proved resistant to some of the design decisions (e.g., Entity Framework was the only API offered initially) that Microsoft made.

That background is a little off topic, but hopefully useful.  The piece that is right on topic is Azure SQL storage.  With a philosophy apps would use lots of modest sized databases or shards and understand sharding, charging by the unit of work which enabled multitenant as a way to reduce costs, the routing being built above a largely unchanged SQL Server engine, and not supporting generic SQL (and its potential for cross-shard requests), Azure SQL launched with a maximum database size of 50GB.  This limit would prove a substantial pain point for customers, and a few years later was increased to 150GB.  When I asked friends why the limit was still only 150GB they responded with “Backup.  It’s a backup problem.”  And therein lies the topic that will drive the discussion in Part 2.

I’ll close out by saying that relatively low cloud storage sizes is not unique to 2009, or to Amazon RDS and Azure SQL.  Google Cloud SQL Generation 1 (aka, their original MySQL offering) was limited to 500GB databases.  The second generation, released this year for MySQL and in preview for PostgreSQL, allows 10TB (depending on machine type).  Azure SQL Database has struggled to increase storage size, but now maxes out at 4TB (depending on tier).  Microsoft Azure Database for MySQL and PostgreSQL is limited to 1TB in preview, though they mention it will support more at GA.  RDS has increased its storage size in increments.  In 2013 it was increased to 3TB, and in 2015 to 6TB.  It is now 16TB or 64TB depending on engine. Why? Part 2 is going to be fun.







Posted in Amazon, AWS, Cloud, Computer and Internet, Database, Microsoft, SQL Server

Amazon Seattle Hiring Slowing?

An article in today’s Seattle Times discusses how Amazon’s open positions in Seattle is down by half from last summer and at a recent low. I don’t know what is going on, but I will speculate on what could be one of the major factors. Let me start by covering a similar situation at Microsoft about 20 years ago. Microsoft had been in its hyper growth phase, and teams would go in with headcount requests that were outrageous. Paul Maritz would look at a team’s hiring history and point out that to meet their new request they’d need to hire x people per month, but they’d never exceeded hiring more than x/2 per month. So he’d give them headcount that equated to x/2+<a little>, and then he’d maintain a buffer in case teams exceeded their headcount allocation. Most teams would fail to meet their headcount goals, a few would exceed them, but Microsoft (at least Paul’s part) would always end up hiring less (usually way less) than the total headcount they had budgeted for. It worked for years, until one year came along where most teams were at or near their hiring goals and a couple of teams hired way over the allocated headcount. Microsoft had over-hired in total, and some teams were ahead of where they might have been even with the following year’s budget allocation. From then on there was pressure on teams to stay within their allocated headcount, both for the year overall and for the ramp-up that was built into the quarterly spending plans.

Could something similar be happening at Amazon? Could this be as simple as Amazon telling teams “no, when we said you can hire X in 2017 we meant X”, and enforcing that by not letting them post 2018 positions until 2018 actually starts? Amazon is always looking for mechanisms to use, rather than counting on good intentions, and having recruiting refuse to open positions that exceed a team’s current fiscal year’s headcount allocation would be a very solid mechanism for enforcing hiring limits.

It will be interesting to see if job postings start to grow again when the new fiscal year starts. That would be the most direct confirmation that this is nothing more than Amazon applying better hiring discipline on teams.

Posted in Amazon, Computer and Internet | 2 Comments