There is an observation I had back in the 1980s that both holds in today’s Cloud world and remains one of the toughest messages to communicate to senior leadership. When you ship a new product or service, major release, or even a major feature (in the Cloud world), your people resources for new feature development is permanently cut in half. The short-term message is no more palatable, but perhaps easier to communicate. For the first 6-12 months after a major release nearly 100% of your people will be unavailable (or their efforts severely degraded) for new feature development. So each budget cycle product teams end up asking for more staffing, even as it seems we are delivering less in the way of features. It isn’t that senior leaders don’t get that there is a tax on supporting existing products, from bug fixing to operations, but they do have trouble with the magnitude of it. For example, non-engineers (or those who haven’t done engineering recently) struggle with how costly yet necessary it is to pay down technical debt.
Two things happened to me in the 80s that lead to my 50%/100% rule of thumb. The first was my experience as a project leader of multiple releases. Each time we did a release I would end up finding the number of person-months I had to schedule to investigate customer issues, fix bugs, perform cleanups of code that had become unmaintainable, deal with dependencies (e.g., a new OS version breaking an existing product), revamp build systems, respond to corporate initiatives (e.g., you must switch to this new setup/installation system), etc. would go up. And over time I realized it would stabilize at about half the team.
The other thing that happened in the 80s is I went back and looked at multiple releases, including those I hadn’t been involved in, and plotted the incoming Software Performance Report (SPR) rate by month against a number of other metrics. SPRs were a means for DEC customers to report bugs, request features, and otherwise communicate with the engineering team about issues. There was no filter on these, even customers without support contracts could submit SPRs, so a complex feature might generate a lot of SPRs even though those resulted in a low unique bug rate. There were two interesting data points here. The first was that incoming SPR rate started to rise dramatically about 60 days after release, the peak occurring around the 6 month mark. While the incoming rate dropped off, it plateaued at a higher level after each release. There were two causes for that, one being just having more features that needed support. The other was that, thankfully, there was a rapidly growing customer base. So even if you drove SPRs per Customer (one of my favorite overall product quality metrics) down, the growth in customers meant more SPRs.
The second data point was that there was a clear correlation between the number of check-ins for the release and the incoming SPR rate, so major releases not surprisingly resulted in more SPRs than minor releases. I was actually able to predict the SPR rate for a new major release would be terrifyingly high based on this metric, a prediction that sadly was accurate. At peak nearly the entire development team was required to respond to SPRs, and for about 90 days before and after there was a high interrupt load on most developers as SPRs hit for their area rendering them unproductive at working on new features.
The Cloud changes none of this, and perhaps makes it even worse. Before you enter a beta or preview period you have no operational burden, minimal deployment burden, only modest urgency on fixing most bugs, etc. The preview is as much about making sure you can operate at hyperscale as it is about traditional beta things like verifying that customers can use the service as intended. Then the day you declare General Availability (GA) you have a 24×7 operational burden. Production-impacting bugs become urgent. It’s the day you start learning where you missed on preparing for hyperscale (see https://hal2020.com/2018/01/20/challenges-of-hyperscale-computing-part-2/ and https://hal2020.com/2018/08/25/challenges-of-hyperscale-computing-part-3/). It’s the day customers start trying to do things you never intended, or perhaps never expected. It’s the day that you start having to plan on paying down technical debt built up during development. It’s the day you have to start dealing with disruptions like the Meltdown and Spectre security issues with an urgency that distracts from feature work. Etc. So just like with a 1980s packaged product, for the first 6-12 months nearly the entire team will be unavailable for feature work and on an ongoing basis only half the team you had at launch will be available for feature work.
I tried for years to find ways to avoid the 50%/100% tax, but never succeeded. So each budget cycle I’d look at all we wanted to do, all that our customers wanted us to do, and go and ask for a significant headcount increase. Each year I would face the pain of telling senior leadership how little feature work we could do without that increase. Each year they would challenge me, and I didn’t blame them. I never found a way to communicate the magnitude of the situation in the context of the budgeting exercise. In retrospect I realize was I should have done at Amazon is written a narrative, outside the “OP1” process, that made all this clear. I could have looked at data for numerous projects that would have (likely) supported my career-long observation. But that would have been too late to help with the decades at DEC and Microsoft where I failed to fully explain the need for the additional people. To be clear, I just about always got the people I needed. It was just more painful than it should have been.
So what prompted me to write this now? I’m watching as the first signs appear that Aurora PostgreSQL is getting past its “V1.0” 100% stage. For example, although Aurora PostgreSQL has not yet announced PostgreSQL 10 support in some regions you can actually find it (10.4 specifically) in the version selector for creating Aurora PostgreSQL instances. Launch must be fairly imminent, with hopefully many more features coming in the next few months. Overall though, it reminded me my 50%/100% rule still applies.