Hal's (Im)Perfect Vision

Challenges of Hyperscale Computing (Part 3)

Posted on August 25, 2018 by halberenson

Back in Part 2 I discussed the relationship between failures and the people resources needed to address them, and demonstrated why at hyperscale you can’t use people to handle failures. In this part I’ll discuss how that impacts a managed service. If you’ve wondered why it takes time, sometimes a seemingly unreasonable amount of time, for a new version to be supported, why certain permissions are withheld, why features may be disabled, etc. then you are in the right place.

tl;dr At hyperscale you need extreme automation. That takes more time and effort than those who haven’t done it can imagine. And you have to make sure the user can’t break your automation.

We probably all have used automation (e.g., scripts) at some point in our careers to accomplish repetitive operations. In simple cases we do little or no error handling and just “deal with it” when the script fails. For more complex scripts, perhaps triggered automatically on events or a schedule, we put in some simple error handling. That might just focus on resolving the most common error conditions, and raising the proper notifications for uncommon or otherwise unhandled errors. Moreover, the scripts are often written to manage resources that we (or a small cadre of our co-workers) own. So a DBA might create a backup script that is used to do backups of all the databases owned by their team. If the script fails then they, or another member of their team, are responsible for resolving the situation. If the team makes a change to a database such that the scripts fail, the responsibility for resolving the issue remains with them. This can be as human intensive or as automated as your environment supports, because it all rests with the same team.

In the case of a managed service the operational administration (“undifferentiated heavy lifting” such as backups, patching, failover configuration and operation, etc.) of the database instance is separated from the application-oriented administration (application security, schema design, stored procedure authoring, etc.). The managed service provider creates automation around the operational administration, automation that must work against a vast number (i.e., “millions” was where we ended up in Part 2) of databases owned by a similarly large number of different organizations.

In Part 2 I demonstrated that the Escaped Failure Rate (EFR), that is the number of failures that required human intervention, had to be 1 in 100 Billion or better in order to avoid the need for a large human shield (and the resulting costs) to address those failures. Achieving 1 in 100 Billion requires an extreme level of automation. For example, there are failure conditions which occur so infrequently that a DBA or System Engineer might not see them in their entire career. At hyperscale, that error condition might present itself several times per day and many times on a particularly bad day. As an analogy, you are unlikely to be hit by lightning in your lifetime. But it does happen on a regular basis, and sometimes a single strike can result in multiple casualties (77 in one example). At hyperscale on any given day there will be a “lightning strike”, and occasionally there will be one resulting in mass “casualties”. So you need to automate responses for conditions that are exceedingly rare as well as those that are common.

As the level of automation increases you have to pay attention to overall system complexity. For example, if you are a programmer then you know that handling concurrency dramatically increases application complexity. And DBAs know that a whole bunch of the complex work in database systems (e.g., the I in ACID) is focused on supporting concurrent transactions. When thinking about automation, you make it dramatically more complex by allowing concurrent automation processes. In other words, if you allow concurrent automation processes against the same object (e.g., a database instance) then you have to program them to handle any cases where they might interfere with one another. For any two pre-defined processes, assuming they have no more than modest complexity, that might be doable. But as soon as you allow a more general case the ability to ensure the concurrent processes can successfully complete, and complete without human intervention, becomes impractical. So when dealing with any one thing, for example a single database instance, you serialize the automation.

I kicked this series off discussing database size limits. The general answer for why size limits exist is the interaction between the time it takes to perform a scale storage operation and how long you are willing to defer execution of other tasks. Over time it became possible to perform scale storage on larger volumes within an acceptable time window, so maximum size was increased. With the advent of EBS Elastic Volumes the RDS automation for scale storage can (in most cases) complete very quickly. As a result they don’t block other automation tasks, enabling 16TB data volumes for RDS instances.

The broader implications of the requirements for extreme automation are:

If you can’t automate it, you can’t ship it
If a user can interfere with your automation, then you can’t deliver on your service’s promises, and/or you can’t achieve the desired Escaped Failure Rate, and/or they will cause your automation to actually break their application
A developer is able to build a feature in a couple of days that might take weeks or months of effort to sufficiently automate before being exposed in a hyperscale environment

One of the key differences that customers notice about managed database services is that the privileges you have on the database instance are restricted. Instead of providing the administrative user with the full privileges of the super user role (sysadmin, sysdba, etc.) of the database engine, Amazon RDS provides a Master user with a subset of the privileges those roles usually confer. Privileges that would allow the DBA to take actions that break RDS’ automation are generally excluded. Likewise, customers are prohibited from SSHing into the RDS database instance because that would allow the customer to take actions that break RDS’ automation. Other vendors’ managed database services have identical (or near identical) restrictions.

Let’s take a deeper look at the implication of restricted privileges and lack of SSH and how that interacts with our efforts to limit EFR. When a new version of software is released it always comes with incompatibilities with earlier versions (and bugs of its own of course). A classic example is where a new version fixes a bug with an older version. Say a newer version of database engine X either fixes a bug where X-1 was ignoring a structural database corruption, or introduces a bug where X can’t handle some condition that was perfectly valid in X-1. In either case, the upgrade in place process for taking a database from X-1 to X fails when the condition exists, leaving the database inaccessible until the condition is fixed. To fix this you have to SSH into the instance and/or access resources that are not accessible to you. Now, let’s say this happens in 1 out of 1000 databases. If the service provider doesn’t automate the handling of this condition then, since the customer can’t resolve it themselves, the service provider will need to step in 1000 times for the 1 million instance example. Did you read Part 2? That’s not a reasonable answer in a hyperscale environment. So the managed service can’t offer version upgrade in place until they’ve both uncovered these issues, and created automation for handling them.

Similar issues impact the availability of new versions of database software (even without upgrade in place). Changes (features or otherwise) that impact automation, be that creation of new automation or changes to existing automation, have to be analyzed and work completed to handle those changes. Compatibility problems that will break currently supported configurations have to be dealt with. Performance tuning of configurations has to be re-examined. Dependencies have to be re-examined. Etc. And while some of this can be done prior to a database engine’s General Availability, often changes occur late in the engine’s release cycle. A recent post in the Amazon RDS Forum was complaining about RDS’ lack of support for MySQL 8.0, which went GA last April. So I checked both Google Cloud SQL and Microsoft Azure Database for MySQL and neither of them supported MySQL 8.0 yet either. To be supportable at hyperscale, new releases require a lot of work.

Let me digress here a moment. The runtime vs. management dichotomy goes back decades. With traditional packaged software the management tools are usually way behind in supporting new runtime features. With Microsoft SQL Server, for example, we would constantly struggle with questions like “We don’t have time to create DDL for doing this, so should we just expose it via DBCC or an Extended Stored Procedure?” or “This change is coming in too late in the cycle for SSMS support, is it ok to ship without tool support?” or “We don’t have time to make it easy for the DBA, so should we just write a whitepaper on how to roll your own?” The SQL Server team implemented engineering process changes to improve the situation, basically slowing feature momentum to ensure adequate tools support was in place. But I still see cases where that doesn’t happen. With open source software (including database engines), the tooling often comes from parties other than the engine developers (or core community)so the dichotomy remains.

It’s not just that management support can’t fully be done until after the feature is working in the database engine (or runtime or OS or…), it is that for many features the effort to provide proper management exceeds the cost of developing the feature in the first place. On DEC (nee Oracle) Rdb I was personally involved in cases where I implemented A runtime feature in a couple of hours that turned into many person days of work in tools. Before I joined AWS I noticed that RDS for SQL Server didn’t support a feature that I would expect to be trivial to support. After I joined I pressed for its implementation, and while not a huge effort it was still an order of magnitude greater than I would have believed before actually understanding the hyperscale automation requirements. So while I’m writing this blog in the context of things running at hyperscale, all that has really changed in decades is that at hyperscale you can’t let the management aspects of software slide.

There is a lot more I could talk about in this area, but I’m going to stop now since I think I made the point. At hyperscale you need ridiculously low Escaped Failure Rates. You get those via extensive automation. To keep your automation operating properly you have to lock down the environment so that a user can’t interfere with the automation. That locked down environment forces you to handle even more situations via additional automation.

When all this works as intended you get benefits like I described years ago in a blog I wrote about Amazon RDS Multi-AZ . You also get to have that managed high availability configuration for as little as $134 a year, which is less than the cost of an hour of DBA time. And the cloud providers do this for millions of instances, which is just mind-boggling. Particularly if you recall IBM Founder Thomas Watson Sr’s most famous quote, “I think there is a world market for maybe five computers.”

Posted in Amazon, Azure, Cloud, Computer and Internet, Database | Tagged hyperscale | 2 Comments

Keezel – Another Internet Security Device

Posted on August 20, 2018 by halberenson

I’m always on the search for new security tools, and this time my hunt took me to Keezel. For full disclosure, I liked the concept so much I made a token investment in Keezel via crowdfunding site StartEngine. Keezel is a device a little larger than a computer mouse that creates a secure WiFi hotspot you use between your devices and another WiFi (or wired Ethernet) network. It uses a VPN to communicate over the public network, so your traffic can’t be compromised. You connect it to a hotel, coffee shop, or other location that has a public/semi-public network you can’t fully trust, then you connect all your devices to the Keezel’s WiFi. So VPN in a box, or puck if you prefer.

Keezel has a few features beyond giving you a VPN. It can block access to known Phishing sites, and also provides an ad-blocker. Both features are off by default but are easy to toggle on. While you may already have software that provides these features, it no doubt has gaps. For example, iOS only supports ad-blocking in Safari itself. And I’ve previously discussed how non-browser apps displaying web pages showed ads that attempted to download malware to a Windows PC. Multiple layers of checks for phishing websites is also valuable given that one source of dangerous URL information may block a site before others.

Keezel has a built-in 8000mAh battery so you can use it without plugging in for a day. You can also use the battery to charge your phone etc. The latter feature is more important than it sounds, because the battery makes the Keezel heavy. When I travel with the Keezel I can leave one of my Mogix portable chargers behind making it roughly weight neutral from a backpack perspective. It’s perfect for the all too frequent cases where the only seats available in an airport lounge or coffee shop are the ones without nearby outlets.

There is one big question-mark on a Keezel, why use one vs. VPN software on the device? There are a number of reasons. The first is that you may have devices that can’t install VPN software. The Keezel lets you take your Fire TV stick, Echo, and other “IoT” devices on the road while keeping them off unsafe networks. The second is Keezel’s anti-phishing and ad-block technology. The third is that VPN services often have a limit on the number of devices they support per subscription. For example, ExpressVPN limits you to 3 simultaneous connections. While that is fine most of the time, occasionally you may want to exceed that number. Fourth, while you may be perfect in turning on your VPN whenever you connect to a public network most people aren’t. For example, what about your spouse or kids? With their devices already set to automatically connect to Keezel, all you need do is connect it to the public WiFi and all devices being used by your party automatically are connected to a secure network.

The major downside I’ve found to Keezel is performance, as it peaks out at about 10Mbps for me. Keezel says the range is 4-20Mbps. I can do much better than that with ExpressVPN. For example, on a 1Gbps FIOS connection I saw 400+ Mbps from an iPhone 7s Plus with no VPN, ~60 Mbps with ExpressVPN, and the aforementioned 10 Mbps from Keezel. Of course public hotspots don’t usually offer high raw speeds, so the Keezel limits may actually be unnoticeable. I haven’t tested it enough to be sure.

Pricing is also a factor to be considered. ExpressVPN costs me $99/year. A Keezel starts at $179 with the lifetime Basic service. Basic has a speed limit of 500Kbps, so is mostly for email and light browsing. A device with a year of Premium service, which brings the “HD Streaming Speed” goes for $229. Premium service can be extended (or added to the Basic device) for $60/year. So while Keezel is initially a little expensive, over multiple years (or many devices) it can work out to be quite cost-effective.

There are some things I’d like to see from Keezel that would make it a better security device. Blocking malware serving sites, not just phishing sites, is a clear one. Reports are another feature I’d like to see, since I like to spot check my networks for potential bad actors. Additional URL filtering capability (e.g., “family safety” as a filtering category) is also desirable. Overall, I’d like Keezel to provide security features more comparable to the EERO Plus service for EERO devices. And, of course, I would like to see much higher performance than they currently provide.

What is my personal bottom line on Keezel? For day-to-day use, where I walk into a Starbucks and need to kill an hour between meetings, I will stick with ExpressVPN to protect any device that needs WiFi. When I’m staying at a hotel, I’ll use the Keezel to create my own secure WiFi network. For scenarios in-between? I’m undecided.

Posted in Computer and Internet, Mobile, Phishing, Privacy, Security | Tagged keezel, VPN | 2 Comments

Amazon and the ACLU

Posted on July 26, 2018 by halberenson

Google, Microsoft, and Amazon have all been under pressure these last few months over being suppliers of technology to the law enforcement and defense markets. Pretty much all technology vendors have sold into these markets since well, the beginning of technology as we know it. And much of the technology we know and love today grew out of government, particularly military, requirements and projects. The Internet and GPS are the two most obvious, but others are all around us. Supercomputers, though now used for many commercial applications, exist almost entirely because of decades of U.S. nuclear weapons labs’ insatiable thirst for compute power. The current level of microchip ubiquity owes a lot to U.S. Military concerns that U.S. industry would be unable to keep up with the military’s need for advanced semiconductors, and thus they funded SEMATECH for the first decade of its existence. While there has always been some public opinion risk to selling technology into law enforcement and defense markets, the current wave of pressure is based on a new dynamic. The cloud changes everything, where the technology provider doesn’t just (fairly quietly) sell hardware and software into a controversial market but also operates it (rather publicly) for its customers. Make the service something AI-related, the 21st Century equivalent of 20th Century nuclear weapons and energy concerns, and you have a topic ripe for public discourse.

Before getting more directly into the ACLU taking issue with the Amazon Rekognition service that AWS offers I was going to set a little more context. The current cloud leaders are primarily (or at least, in the case of Microsoft, heavily) focused on consumer-direct offerings. It’s a lot easier to use public indignation as a weapon when a company sells to the public then when its customers are other industrial companies. For example, how much public pressure could you put on IBM or Digital Equipment to stop selling for defense use? You go to Lockheed or Boeing or Northup-Grumman and say “stop buying from these guys because they sell to the CIA, Air Force, Navy, etc.” and they look at you like you have two heads (or none at all actually) because those are their customers too. Bad analogy? Ok, you go to Ford and tell them to stop and they start telling you about this Ford Aeronautics subsidiary that sells to the defense market. Ford Aeronautics (which Ford sold to Loral in 1990) was a subcontractor on multiple nuclear missiles, amongst other things. Now Ford would seem to have been a target for protests against nuclear weapons, but I suspect any such effort in the 50s-80s would backfire. How about GE, GM, Goodyear, Chrysler, etc.? Same story. So while Digital Equipment Corporation never made military-specific products, their products were used everywhere in defense and law-enforcement realms.

Until at least Watergate, and really up until the end of the cold war, being a supplier to the defense of the United States was nearly always a net PR positive. Let’s not forget that John F. Kennedy was elected President partially by hammering home a message that the U.S. was behind in nuclear missiles, the so-called Missile Gap, and appointed Ford-executive Robert McNamara to be Secretary of Defense. Or that Ronald Reagan later used the industrial might of the U.S. to force an end to the Cold War. I don’t bring this up to be political, but rather to point out these issues are often orthogonal to political party or philosophy.

So now to the ACLU. The ACLU has gone to war against Amazon Web Services offering facial recognition technology (Amazon Rekognition) to law enforcement agencies. Note that Rekognition is not specifically about facial recognition, and doesn’t specifically target law enforcement requirements. It is a generalized image (and video) recognition technology, and it is this generality that makes it a cost effective commercial offering. Facial recognition is, not surprisingly, a popular use case. The ACLU’s first attack came back in May when they discovered Rekognition was being used by some law enforcement agencies for facial recognition. Then this week they launched another barrage by showing that using default settings Rekognition falsely identified members of Congress as matching images found in a mugshot database. I felt really bad for the Rekognition leadership, former co-workers and friends, as I’m sure they never expected to find themselves being attacked by the ACLU. However, in some ways this was obviously coming. The ACLU doesn’t appear to have much influence with Law Enforcement, it is a generally adversarial relationship. The ACLU doesn’t appear to have much more of a fan base amongst members of the current Congress. So attacking a technology supplier, particularly one part of a consumer-focused company, is one of the few tools at the ACLU’s disposal. In other words, you can’t get Law Enforcement to stop using facial recognition so maybe you can make it harder for them to obtain the technology.

For all the hoopla here, AWS has no exclusivity on providing facial nor general image recognition technology. Beyond other commercial technology suppliers, the FBI, Homeland Security, and other large law enforcement agencies have privately developed and operated systems for doing facial recognition. What AWS has done with Rekognition is democratize the availability of this technology, making it affordable for (amongst many others) smaller law enforcement agencies. If AWS stops selling Rekognition to Law Enforcement it will have no impact on, for example, the NYPD’s use of facial recognition. It may create a country of have and have not agencies, where the NYPD has the ability to scan a crowd for a kidnapped child but small departments can not. Admittedly that’s the positive spin on Rekognition, a more negative spin is that New York will become an Orwellian nightmare while small cities and towns remain free of the surveillance state. If you believe preventing small agencies from having access to Rekognition will keep the surveillance state at bay then I have a bridge to sell you in Brooklyn, surveillance cameras (which you can rip out) and all. What will really happen is that an alternate service, from a provider without a consumer business and perhaps privately held (so even shareholder pressure doesn’t work), will emerge. Or Congress could even mandate that a Federal Government-developed solution be offered to local law enforcement agencies at subsidized pricing.

This leads back to where this is really going, that attacking Rekognition is all about trying to force the Federal government to put in place acceptable (to the ACLU of course) rules for the use of facial recognition technology. Microsoft’s Brad Smith argued this exact end-game a couple of week’s ago. While regulation, even more so premature or overreaching regulation, is not something I’m a fan of some regulation in this space is inevitable. Without it we will end up with a patchwork of legal rulings that attempt to map 21st Century technology to our Bill of Rights and century-old laws that are aging badly in the face of new technology. Brad called out some very good issues that should be addressed.

Today’s blowup is largely a technology stunt by the ACLU. Let’s say you want to present a picture with an animal in it and ask one of three questions. Question one is “Is there a dog in this picture”. Question two is “Is it a Bernese Mountain Dog”. Question three is “Is it MY Bernese Mountain Dog”. The use cases for these three questions may be very different, and the confidence level required may be different as well. The default confidence level for Amazon Rekognition is 80%, which is fine for doing quick scans of photos looking for dogs. Yes you will get an occasional false positive in there, such as a coyote, fox, or house cat. Asking the Bernese Mountain Dog question likely requires more than 80% confidence to avoid an overwhelming number of false positives, because there are enough other breeds with similar colors. Or take the Greater Swiss Mountain Dog, the differences (most obvious to the casual observer is coat length), means at 90% you may still see a lot of Swissies in with the Berners. Trying to pick “my” dog out of the crowd probably requires 95% confidence and even then will yield occasional false positives, something I know from my own experience looking at a Berner picture and mistakenly thinking it was of my dog. So when the ACLU used an 80% confidence level to match members of Congress with mugshots yielded a bunch of potential Congressional criminals that should have come as no surprise. 80% seems like basically what you’d get from a mediocre criminal sketch artist drawing. Enough to take a closer look at someone, but not a definitive match. Had the ACLU used the 95% confidence level it would have seemed like less of a stunt and more of a real warning about use of the technology, but I suspect the press will mostly echo the ACLU’s message.

For me the ACLU’s attack on Amazon Rekognition damages their credibility, and as a sometimes contributor/member probably sends me into another cycle of being negative on them. I just don’t like seeing good, and indeed broadly game changing, technology being used as a whipping boy to get around their (or anyone’s) public policy impotence. I guess I’m just not generally a “the ends justify the means” kind of guy.

Posted in Amazon, AWS, Cloud, Computer and Internet, Google, Microsoft, Privacy | Tagged ACLU, Rekognition | 2 Comments

Amazon and Sales Tax

Posted on June 23, 2018 by halberenson

This week’s decision by the U.S. Supreme Court that overturns a pre-Internet requirement that a company have a physical presence in a state in order to be compelled to collect sales tax on behalf of that state is the biggest legal gift Amazon has received in a long time. It is perhaps the biggest legal gift it has ever received. Of course that can be hard to tell from all the press this week, so I’m going to dive in on it. One important disclaimer, I had nothing to do with (and have no proprietary information about) the retail side of Amazon. These are my personal opinions.

In the early days of the Internet the lack of sales tax on most transactions played a role in the growth of eCommerce. Of course, one can debate just how significant that role was since the lack of sales tax was only one of the attractions. The lower price of the purchase itself, convenience of shopping from your desk at work or at home, and (my continuing favorite) access to a vast set of items, sizes, and styles that you couldn’t easily find in local brick and mortar stores, arguably had far more influence over the shift to online shopping. However significant the lack of sales tax on the Internet may have been back in the late 1990s, its role has been diminished over time. I don’t think there are any Millennials who heard about this weeks Supreme Court ruling and went “that’s it, I’m going to have to start going to the mall every Saturday”. For you Millennials and Centennials, that was the normal shopping experience in the pre-Internet days. Particularly in the golden age of Blue Laws, when stores did not stay open late enough to shop after work during the week and then were forced by law to close on Sunday. So Saturday was it. But I digress.

My overall guess is that the lack of sales tax played at most a minor factor in the growth of eCommerce, and most of that boost came in the first few years. But it might have played a bigger role in who the winners and losers were in eCommerce. It’s pretty obvious on the surface that if a web search shows up two suppliers of Product X at the same price, and one collects sales tax and the other doesn’t, that you are likely to buy from the one that doesn’t. BN.com (Barnes and Noble) might have been disadvantaged by having to collect sales tax while Amazon.com did not, but Buy.com and other early pure-play online book sellers had the same sales tax advantage as Amazon yet it was Amazon and Barnes and Noble as the last two standing. If you time travel back to 2000, the Amazon vs. BN.com discussion wasn’t around sales tax it was about Amazon’s recommendation engine. Both were about equal at letting you buy a book, but Amazon was the far better site for discovering new things to read. Moreover, it was Amazon’s leadership in e-books that drove the longer term shift in book buying and reading habits. Kindle, not Sales Taxes, was the ultimate differentiator.

As eCommerce grew the pressure for retailers to collect (and remit) sales taxes grew with it, and since Amazon was growing the most that put the focus on Amazon. Until fairly recently Amazon was reluctant to collect sales taxes. While there are no doubt technical complexities involved (and the Supreme Court decision references those, some states have complex rules with multiple taxing authorities), it was mostly competitive. Amazon collecting sales tax when key competitors do not leaves it at a disadvantage. Given how data-driven Amazon is, they no doubt knew exactly how much negative impact there would be in any given state when they started collecting taxes in that state. They could then compare the business advantages of having a physical presence in the state with the negative impact of collecting sales tax in making decisions. Offering Prime Now (which requires local distribution centers), pop-up kiosks for Echo, Fire TV, and other digital products, having AWS sales offices, etc. outweigh the (likely slight) downward pressure on sales from collecting sales taxes. So after a few years of gradually expanding the states it collects sales taxes for, Amazon went to collecting sales tax in all 50 states.

Amazon still doesn’t collect sales tax when it is providing a marketplace for third-party sellers. Legally that is the seller’s responsibility, and this is a case where I think there are likely technical complexities at work too. With the physical presence model Amazon would have needed to be aware of every location the third-party seller had a physical presence so it knew to collect the tax. If the seller failed to notify Amazon that it had established a physical presence in another jurisdiction then it could have left Amazon legally exposed. At least it seems likely that Amazon’s attorneys would have been making that point. But with the physical presence requirement no longer in force, it it would be easy for Amazon to collect sales tax for any state based purely on shipping address.

For Amazon what the Supreme Court ruling does is level the playing field. No eCommerce competitor will be able to undercut it based on not collecting sales tax. And its own marketplace sellers will not be able to undercut direct Amazon sales by matching Amazon’s price but not collecting sales tax. Since Amazon already collects sales tax on its own sales, there is no change in its position relative to brick and mortar competitors like WalMart. For Amazon, this Supreme Court ruling looks like a complete win.

Posted in Amazon, Retail | 6 Comments

Adblockers are the new AntiVirus

Posted on June 19, 2018 by halberenson

Back in November I wrote a blog entry about good browsing habits being insufficient to protect you from malware. Here is an update. This week I had three brushes with malware, all three having to do with news aggregators. One came through Microsoft’s News app (previously called the MSN News app), one through the Flipboard app on Windows, and one through Yahoo’s web portal. Both the Microsoft and Yahoo cases were attempts to get me to install a Fake AV. The one that came through Flipboard was worse, it was a drive-by download (meaning it downloaded a file to my computer without my being prompted to allow that). Fortunately Windows Defender Antivirus caught and quarantined the drive-by file. And this happened despite my having tightened up my browsing habits further, by absolutely refusing to click on sponsored links in news aggregators.

First a word on sponsored links (particularly in news aggregators): JUST SAY NO to clicking on these links. These are enticing socially engineered stories designed to pull you in to a website that is all about serving ads. You recognize it, usually a slide show with a single slide per page and numerous (even dozens of) ads around it. You have to click-through the slide show, with each slide resulting in numerous additional ads being displayed. The pages are even designed so that elements display later, moving the “Next” button after a few seconds. If you try to click next before the page has fully rendered you end up clicking on an ad instead. These are evil pages, but the news aggregators love to include links to them because they are paid to include them. Actually calling the sponsored links evil may be a little too harsh. The content can be useful, or at least entertaining, but it comes at a high price. That price is enabling malware distribution.

The real culprit here are ad-serving networks. The ad-serving networks appear to have very poor control over their customers including malware in ads they submit for distribution. Someone wants to pay them to display an ad 5000/times a day, no problem! So amongst the tens or hundreds of thousands, or millions, of legitimate ads they serve up on websites each day occasionally one with malware shows up as well. And these ad-serving networks are being used everywhere, from little mom and pop websites, to large news organization websites, to our lovable sponsored slide shows. The more ads you see, the higher the odds a malicious ad will be displayed as well. Some you might have to click to have a malicious result, but just like that (evil!) auto-play video others pose a threat just by being loaded. My two brushes with Fake AV are perfect examples. I went to a legitimate mainstream website and the scary Fake AV window displayed with no further action on my part. Yes it would have taken overt action to actually download malware, but the whole point of Fake AV is to scare the user into performing the download. It works all too often.

What makes the sponsored links pages so dangerous is the sheer number of ads they serve. One slide show can display hundreds of ads. Do a couple of slide shows a day and you are seeing many thousands of ads a week. Under those conditions, hitting an ad distributing malware is going to happen with some regularity.

But I was hit this week without going through a sponsored link. In fact I looked at the claimed source in the news aggregator. In all three cases it looked like a news site I was familiar with. In the Flipboard case I now believe it wasn’t; more on that later. What is true is that in all three cases I was operating without ad blocking software.

Recall that browsers are really made up of webpage rendering engines that turn HTML, CSS, and JavaScript into the pages that we view and interact with. Those engines can be invoked independent of the environment around them that we know of as “The Browser”. It is The Browser (Edge, Chrome, Firefox, etc.) that we are using with direct browsing that provides capabilities such as invoking ad blocking extensions. The engines themselves neither perform (general) ad blocking nor invoke extensions. So when an app such as Microsoft News renders a web page, it does so without the ad blocking extension you’ve installed in your browser of choice. And in something related, when you use the InPrivate (Incognito, etc.) modes of the browser the extensions are disabled. This explains why most of the ad-carrying malware I see comes from news aggregators. Even my Yahoo example turned out to be an InPrivate window I’d launched to log in to a website with an ID other than the one I normally use. I’d just forgotten to kill it when I was done with that one task, and used it for general browsing. InPrivate disables extensions because they might leak information you are trying to keep private, so I was caught without my usual protection.

That brings me to the main point. In the beginning there was Antivirus software. Then we discovered software, such as browser toolbars, that were tracking us and stealing information from our machines, so we created Anti-Spyware. We created Firewalls to block undesirable network access, intrusion detection and prevention systems, various white listing solutions (app stores, SmartScreen, AppLocker, etc.) to limit the running of bad code, etc. But there is one more tool needed for security, Ad Blocking software.

Historically Ad Blocking has been more about convenience (i.e., ads are annoying) and performance (lots more to download to display the ads). Since ad personalization is a driver behind many intrusions on our privacy, and a channel for distributing malware, we need to treat them as malicious. Ad blocking software prevents the ads themselves, as well as other web page elements used to track us (presumably for ad personalization) from being rendered on a web page. Today you have to use a browser extension, but ad blocking as a default feature in web browsers is just around the corner. Though I suspect that will still leave us with the problem that it is a browser feature, not a core engine feature, and thus not always available when pages are rendered.

I have not yet gone the route of a paid system-wide ad blocker like Adguard for Windows, but I’ll likely give it a try. If that will work for blocking ads in applications like Microsoft News, then it would be worth paying to get the extra protection. In the meantime on Windows I’m using the free Adguard Adblocker extensions for Edge, Chrome, and Firefox. I use 1Blocker on iOS. Well, at least that’s what I do as of this writing. I’ve tried a number of them on iOS and found very little difference in the user experience.

One comment on Flipboard and the drive-by download. I originally thought this was ad delivered, but on reflection I may be wrong. It looked like a story on a mainstream website, but took a long time to load. The link may actually have been to an intermediate site that first did the download then redirected to the mainstream site. Looking like, and eventually redirecting to, a mainstream site was the social engineering to get me to click on the story link. Trusting that Flipboard was being careful to avoid displaying misleading story links was a mistake on my part. All news aggregators have the problem that in order to give you everything you are looking for they will include stories from the long tail as well as mainstream sites. If their curation processes can’t identify long tail websites that are compromised or misleading (or simply not careful about content or ad networks), then they make it that much harder to stay safe on the web. So Flipboard may have been an ad problem, or it may have been something worse. At this point all I can say for sure is that my trust in Flipboard has been diminished.

The days of the ad-supported “free” Internet appear to be coming to an end. Privacy concerns with the tracking needed to do extensive ad personalization has moved blocking ads and trackers from a niche to mainstream desirability. The abuse of ad networks to distribute malware will make an blockers pretty much mandatory, and will soon result in ad blocking being built-in to browsers. At that point, how do you make money off advertising? The ad industry may have a window to clean up their act and prevent the industry’s collapse, but that window is small and shrinking fast.

Posted in Computer and Internet, Security, Windows | 1 Comment

Playing the Amazon Blame Game

Posted on May 11, 2018 by halberenson

Does Macy’s tell Gimbels? Gimbels, Korvettes, Gertz, Lechmere, Lafayette, Woolworths, Montgomery Ward, Bradlees, and Zayre are amongst the dozens if not hundreds of retailers that I recall from my youth that have long since disappeared. Many others merged into that blob now known as Macy’s, which isn’t Macy’s at all. Macy’s itself suffered the indignity of being swallowed by arch-competitor Federated Stores, along with almost all other department store chains in the country, who then homogenized them all under the Macy’s banner. And Sears, once the undisputed king of retail in America, lost its leadership position to Walmart and has spent the last few decades steadily slipping towards oblivion. More on Sears, its sister Kmart, Korvettes, and the interesting story of Zayre, coming up after this commercial break.

Do you suffer from Amazon.com anxiety? Is your industry about to fall victim to this irresistible force? Relax! With Time Machine in a Bottle you can go back to 1994 and tell your younger self that the millennials and centennials are coming. Yes, you too can try to convince your younger self that they’ll survive Y2K and Walmart only to have their throats ripped out by generations who never knew a world without universal computation and connectivity. And you can regale them with stories of how Boomers and Gen X were happy to help the millennials and centennials feed on your entrails. That’s Time Machine in a Bottle; When you really want to understand the futility of trying to get non-technologists to understand the coming impact of technology.

Retail is a tough long-term business. The winners and losers change with consumer tastes, demographics, and shopping habits. With a few exceptions, the retailers who dominated the city and town center shopping scenes of the 19th and pre-WWII 20th century failed to capitalize on the post-WWII move to the suburbs by what we today call the Traditionalists (aka, “The Greatest Generation”). Many that did failed to hold the attention of the Baby Boomers and GenX. Malls died, big box stores took over. The headlines are about Amazon now, but for over a decade the headlines were focused on how Walmart was destroying local retail. Life still isn’t easy for Walmart, for example they are still banned from opening stores inside the New York City limits.

Woolworth defined the “5 and 10” store concept and was joined by S.S. Kresge amongst others. Woolworth was the largest retailer in the world as recently as 1979, but “5 and 10” was a dying format. It was one of the ones that didn’t really translate to the suburbs. Woolworth tried other formats, eventually selling its WoolCo department stores to Walmart. It closed the U.S. Woolworth stores, but the company still exists. Although it was failing overall, Woolworth was being successful with sporting goods. Today we know it as Foot Locker. S.S. Kresge also moved beyond its “5 and 10” roots by opening larger general department stores under the name Kmart. That happened about the same time as Wal-Mart (as it was then styled) was founded and Dayton’s started Target. This was a really rich category actually, with chains such as Zayre, Bradlees, and Ames also coming into existence in the late 50s and early 60s. Too many apparently, as most disappeared leaving Walmart and Target to become America’s iconic Brick and Mortar general merchandise retailers. They were joined by specialty big box stores like Home Depot and Best Buy, and membership stores like Costco and BJ’s Wholesale Clubs, to dominate the late 20th Century/early 21st Century retail scene. With the exception of Federated Stores a.k.a. Macy’s, few pre-50s major retailers are relevant today.

I could write pages on my perspective on retail history but, beyond probably being boring, I really want to focus on the current transition in retail and other histories. I posit that everything we “blame” on Amazon would have happened anyway, it is just happening 3-5, maybe even 10, years faster than if Jeff Bezos and company weren’t in the picture. Well, you say, if it weren’t for Amazon then Walmart, Target, Best Buy, etc. would be dominating e-commerce. Really? That’s not what the history of previous transitions in retail suggests. It suggests that new leaders emerge from each transition, with most old leaders struggling and either coming out the other side significantly diminished (ala Kmart and Sears) or gone entirely. If Amazon wasn’t there then someone else would have emerged to become “Amazon Light”. But it likely wouldn’t have been one of the top brick and mortar retailers.

Let me illustrate this with a company whose name I actually don’t recall. A former colleague had come from leading IT at a medium size multi-store general merchandise retailer. He told me that the CEO, a very sharp retail guy, had signed over their website to a third part under a 10 year contract because “the sales from the web didn’t even add up to the sales from one retail store”. A couple of years later e-commerce had exploded, but this retailer found itself unable to participate. It’s unclear if they will still be in business by the time they can reclaim their online presence. That’s a pretty typical story for a legacy player in a transitional environment. Think back to how tentative the brick and mortar crowd really was at the start of eCommerce. Barnes & Noble, which was expected to wipe out Amazon when it started selling books online, formed a separate company (bn.com) with other players to go after the new market, before eventually buying back the piece it didn’t own.

Recall what Jeff Bezos said last year when asked what Day 2 looks like? “Day 2 is stasis. Followed by irrelevance. Followed by excruciating, painful decline. Followed by death.” That is Sears. Sears was Amazon in oh so many ways. It was a Day 1 company from the 19th Century through the 1980s. Its catalog operation was as important, and I believe as loved, as Amazon.com is today for many decades. Even as a technology provider Amazon may have AWS, but Sears had Prodigy. In other words, it had enough foresight to see the coming importance of online almost a decade before the explosion of the public Internet. But in 2003, just as e-commerce was really starting to take hold, it closed its general merchandise catalog business. Sears had become a Day 2 company, and it is closing in on that final step of Day 2.

If Sears, the first A-to-Z national retailer that delivered everything to your doorstep (or at least the local railroad or stage-coach station) in even the smallest communities, and made the transition to bricks and mortar as American’s first embraced cities and then suburbs, couldn’t lead America into eCommerce then no existing retailer was going to do it. How a company that had all the pieces, from the catalog to the online system to a century of “last mile” experience to having thrived over the course of 100 years of dramatic changes in retailing could so thoroughly miss this transition is almost beyond comprehension. But there it is, they left a gap and Jeff Bezos was happy to fill it.

Which brings us more towards current battles, particularly the battle for customers’ between Walmart and Amazon. This is less about “Day 2” then about long-term consumer preferences. Let me start with two examples. Korvettes was a large east coast discount department store chain that later in life fancied itself a competitor to Macy’s. Actually think of the positioning as like Walmart to Target today. While there was overlap, they largely appealed to different audiences with Macy’s being more upscale. Korvettes wanted to attract the more upscale crowd and upgraded its merchandise (and correspondingly prices) as it tried to change its image. The move failed as the more upscale shoppers, epitomized by my first girlfriend’s older sister who said “I’d never set foot in Korvettes”, resisted all incentives. I think Korvettes’ executives finally realized it when they ran a heavily advertised sale on Mr. Coffee machines at something like half of Macy’s prices, then stood at a conference room window (the offices and flagship store were in Herald Square, across the street from Macy’s flagship store) and watched numerous shoppers still walking out of Macy’s with Mr. Coffee machines. At least my father, a (IT) VP at Korvettes, came home that night knowing the magnitude of the problem. Korvettes attempt ended up alienating their core customer base and they didn’t come back, becoming a factor in Korvettes’ demise.

The second, and shorter story, involved Walmart itself. They decided to go upscale to get the shoppers who were frequenting Target. I heard the exact same quotes from my affluent friends who were big fans of Target, that they wouldn’t be caught dead in Walmart. And while I do shop there, I hate the experience and will always go to Target instead if one is convenient (and I haven’t already ordered online, which is the more common case the last few years). Walmart itself recognized the lack of success, and reversed course before incurring much harm.

So as you look at on-line retail over the next few years Walmart has a much bigger challenge in front of it than Amazon.com. It both has to defend against losing its core customer base, and it has to attract customers to shop at Walmart online who have a poor overall view of the brand. This problem would exist even without Amazon in the picture. Meanwhile Amazon, with a brand that is viewed as Nordstrom-like service at Walmart-like prices, can work to attract Walmart’s core customers with little risk of harming the brand.

One retailer that has been successful in spite of the growth of Amazon and eCommerce is actually Zayre. Zayre, the brand and original stores are long gone, but the company lives on. Back when I got my first apartment I mentioned to my father that I was largely furnishing it from a store called Zayre. He proceeded to tell me how respected they were in retail, particularly for their advanced use of technology but also just for being forward thinking. And indeed Zayre saw the future of retailing and started to shift. It opened an early competitor to Costco, BJ’s Wholesale, and one comparable to Home Depot called Homebase (which later closed). It opened T.J. Maxx. It then sold the Zayre name and stores and renamed itself TJX Companies. Eventually it spun out the membership-based stores, bought off-price competitors like Marshalls and Sierra Trading Post, and created off-priced home furnishings store Homegoods. TJX continues to thrive even as most retailers struggle, although it too must figure out how to better address Millennials and Centennials or someday face the music.

What Amazon is good at is focusing primarily on customers, rather than raw technology, products, or the competition. It finds unmet, or poorly met, needs and tries to delight the customer with alternate solutions. It also tries to skate to where the puck is going, not where it is. It keeps course correcting until it intercepts the puck. And if it smells blood in the water, that is if Amazon has enough success with an initiative to really know it is going to intercept the puck, it goes all in. If Amazon enters your market the reaction shouldn’t be “Oh S&^( Amazon is going to kill us”, it should be “what can we do to serve our customers better?” Put another way, if Amazon is entering your market then the problem isn’t them, it is you.

Posted in Amazon, Computer and Internet, Retail | 2 Comments

The rise of custom chips

Posted on April 7, 2018 by halberenson

The Verge published an interesting piece this week on why Apple (at least as the rumor goes) will replace the use of Intel processors with its homegrown ARM-based processors. The author asserts that it is because Intel is standing still. That is arguably the case, but I see this as part of a bigger trend. We’ve reached an inflection point where it broadly makes more sense to do custom chip designs than use off the shelf components.

The use of custom chips (ASICs) for very specialized purposes has been with us for decades. For example in the 1980s the use of a custom ECC chip allowed DEC to go from a virtual unknown in disk drives to an industry leader. But in general Moore’s Law made custom chip design a losing proposition. The rule of thumb was that any performance advantage you could get from the custom design would be exceeded by the next generation of general purpose processors. The timeline was simple, a new manufacturing process generation would be introduced simultaneously with a new general purpose processor. Availability of that process generation for custom designs would follow sometime later. You would get your custom chip designed and in production in that process just months before the semiconductor companies would introduce a new process and processor. Your custom chip that looked so good on paper suddenly was expensive and offered little benefit over the newly introduced general purpose processor. Your project would either not make it out the door, or the next generation of it would forgo the use of custom silicon. I saw many attempts at custom chip designs fall to this cycle.

Many things have changed over the decades. The gap in when process technology leaders like Intel had a generation available for its own designs and when that generation was available for custom designs shrunk (or disappeared). The rise of chip manufacturing Foundries (also key to the rise of ARM), related by their going from being generationally behind on processes to being competitive with Intel. Indeed Intel itself has gotten into the Foundry business. The availability of licensed architectures, licensed core designs and components, and the design tools to use them lowered the engineering costs of custom chips. Etc. But most importantly, the rise of users with sufficient volumes to justify custom chip designs.

Apple sells enough devices to justify custom designs and gain early access to the latest process technology. AWS and Azure also have that kind of buying power and business justification. AWS Nitro uses a custom chip (done by its in-house design team) to great advantage, including allowing a “bare metal” general purpose computer to operate in, and take advantage of all the benefits of, the AWS infrastructure. Apple uses its chip design capabilities to get higher performance than off the shelf ARM chips, and to provide features specific to its unique user experiences.

Once you have a great in-house design team, and know how to get the best out of partner foundries, the question of where else you can get advantages out of custom chips is on the table. Are all the ideas we had back in the 90s for using custom chips to speed up databases, which fell victim to the Moore’s Law/General Purpose Processor cycle, now back on the table? They (and newer ideas of course, like using GPUs to speed query optimization) should be.

By the way, I’m a little skeptical on the rumor about Apple fully moving from Intel to its own ARM-based design for the Mac. It makes sense for the MacBook, but not the iMac/Mac and particularly the Pro versions of them. It doesn’t matter if I’m right or wrong in the short-term, any move away from Intel x86s to Apple custom ARM-based processors in the Mac line foretells the day is coming when custom chips power the entire lineup. And that is something that could easily spread to Windows-based PCs as well.

Posted in AWS, Azure, Computer and Internet | 2 Comments

My Mobile Phone is Sacrosanct

Posted on March 27, 2018 by halberenson

Sorry for my absence the last few weeks, I’ve been rather busy with a move. I’ll try to get back to regular blogging, and I have a huge backlog of topics. Here is a short one, the importance of my mobile phone has reached the level where I am reluctant to take risks with it. And that is causing problems.

I recently decided not to enable my cell phone to connect to a client’s email system. Like most organizations, my client’s IT organization requires any device accessing its email system to submit to its Mobile Device Management (MDM) regime. For the most part that is not a problem as I already manage my phone that way, for example requiring a PIN to unlock it, and having the device set to erase itself after a number of failed PIN entries. The usual MDM regime has one “feature” I can no longer tolerate, the ability for the organization to erase the contents of your mobile device at its discretion. And, in particular, at termination of “employment”. If I were a full-time employee, expecting to retain that status for an indefinite (i.e., multi-year) period, that might not be such a big thing. But as a consultant my access to the client’s email system might not last beyond a few months, or could even last just a few weeks. Then my phone would be wiped.

Up until recently I didn’t really care about wiping my phone, because everything really lives in the cloud. Or so I used to think. I would regularly switch devices, and all my important data, emails, etc. would be available on the new device. Thank you OneDrive, OneNote, Cloud Drive, Exchange, iCloud, etc. But increasingly there is something critical that is local only, two-factor authentication (2FA). My phone has become my identity.

My phone has been used as a 2FA device for a long time, with many sites texting me a code I had to enter for login (or authorization of certain actions). And if that was the extent of it then wiping the device isn’t really a problem since the phone and SIM retain the phone’s physical identity. But recently more and more sites are depending on authentication apps running on the device and maintaining local state. For example, Microsoft’s Authenticator, Google Authenticator, MobilePass+, etc. Lose one of those apps and re-acquiring access to the sites that were being protected is a nightmare.

Not long ago I accidentally deleted an authentication app and discovered it would take at least 24 hours to re-acquire access to the account it protected. Basically the sites recovery process was to insert a 24 hour delay between the request to turn off 2FA and it take effect. This was done in the name of security. Then you had a few hours to access the site with a temporary code, before that code became invalid. Then you had to request a new code, which came 24 hours later, and so on. I was always busy when that code appeared, so it took days to regain access. Yeah, this is an extreme example. But not the only one. Since the purpose of 2FA is to provide very strong access control, recovery from loss of a 2FA device is almost always intentionally very difficult.

I was about to make the final tap on my phone to add the client’s email system when the impact of having my phone wiped hit home. I would immediately lose access to most of my life. My personal email, my bank accounts, even Twitter. Losing access to my email would be the worst, because the recovery processes for most things go through email. It would take me days, of effort to put my digital life back together. The process would spin further out of control if I didn’t have other devices with me, or they too were wiped. For example, if my iPad were wiped at the same time for the same reason. I’d be living a dystopian nightmare. I cancelled connecting my phone to their email system.

This is all starting to have negative impact, something that will only grow as our phones become more a part of our identity. I’ve missed time-dependent mails from the client because I either need to log in with OWA (which needs 2FA of course), or use my iPad (which I did connect to their email system). I have become reluctant to upgrade my phone, because that creates the same situation. I’d have to pre-plan the upgrade, turning off 2FA where possible and scheduling time to go through the replacement process where it isn’t. I’ve even turned off the auto-wipe feature, because the impact of someone wiping out my identity is now greater than the likeliness they can break into the phone before I do my own remote wipe (or otherwise disable the phone’s access to my resources).

I know I’m going to hear from people that they use solutions like carrying two phones with them, one for work and one for personal use. That doesn’t work for me, and only addresses the catalyst for this post rather than the core issue. A better solution for the work/personal data problem is for efforts to compartmentalize work data on a personal device to become ubiquitous. Your employer would never have, nor need, the right to wipe your entire device but rather have a way to wipe just their data. But that doesn’t go far enough.

Are their mechanism to get around the loss of a 2FA device? Sure. My Twitter backup codes are sitting in a safe 2000 miles from where I’m writing this. Not too useful a mechanism. Well, why not store them online somewhere? Ok, in the case of just losing 2FA access to Twitter that would work. In the case of my phone being wiped I would lose access to the store I had them in. Put them in a store that doesn’t require 2FA? Umm, remind me why we are doing 2FA to begin with?

Authy, an authentication app that has multi-device support and secure cloud backup is probably the best current approach, to the extent that it can be used to replace the other authentication apps. But it can’t always (e.g., I don’t think it can replace MobilePass+, which is often used for Enterprise network access). It also isn’t clear that Authy, or a similar 3rd party HOTP/TOPT app, will play a part in future authentication mechanisms. As Microsoft, for example, moves away from the use of passwords its solution may require the Microsoft Authenticator app rather than allow for Google Authenticator, Authy, etc. as alternatives.

As we continue the rapid move to our phones being our identities, every identity provider needs to provide a more robust way to recover from the loss of phones. But for now, I’m treating my phone as sacrosanct. No you can’t have permission to erase its contents. And no, I’m no longer upgrading my phone frequently.

Posted in Computer and Internet, Mobile, Security | Tagged 2FA, Authentication, two-factor authentication | 4 Comments

Challenges of Hyperscale Computing (Part 2)

Posted on January 20, 2018 by halberenson

In part one of this series I used recent increases in maximum database size as a driver for introducing the challenges of hyperscale computing. In this part we dive into the heart of the matter, which is what it takes to operate at hyperscale. Where hyperscale computing begins is an academic question, and the lessons here can be applied to modest numbers of computer systems as well as huge numbers. The difference is that with modest numbers you have choices, with huge numbers (as you shall see) you really don’t. For our purposes we will assume hyperscale means at least 100s of thousands of “systems”, and will use 1 Million Virtual Machines (instances or virts) as a good order of magnitude for illustration. To put this in context, AWS has millions of customers and they each have at least one, and probably many, instances. Even when a customer is using something that is “serverless”, there are instances behind the scenes. So rather than being far-fetched, 1 Million is a good order of magnitude to focus on.

Say you are a DBA dedicated to the care and feeding of an important database. Nightly backups of that database fail (meaning need human intervention) 1 in 1000 times, so you get paged about a failed backup once every three years. You sleep well. Or you are responsible for 100 databases. With a 1 in 1000 failure rate you are being paged every 10 days. Still not too bad. How about 1000 databases? Now you are being paged for a failure every day, 365 days per year. This is starting to not be any fun. How well do you sleep knowing that at some point during the night your pager will go off and you will have to work for minutes to hours? At this point one “primary” responder (be that a DBA, Systems Engineer, SDE, etc.) isn’t even possible, you need at least two so someone is always available to deal with failures. Really you need at least three, and by some calculations four to five (when you factor in vacations, health issues, turnover, etc.).

How about 1 million database instances? At our 1 in 1000 failure rate you need to handle 1000 failures per day! This turns into an army of people doing nothing but responding to backup failures. How big of an army? Let’s say a backup failure can be resolved in 15 minutes, so one person can handle 4 failures an hour. They handle failures 7 hours (assuming 1 for lunch, breaks, etc.) a shift, so 28 failures each. That translates to 36 people dedicated to handling backup failures each and every day. To achieve that you would need an overall team size of between 108 and 180.

Is a team of 180 people to handle backup failures practical? Is it cost-effective? Does anyone really want to burden their cost structure with all these people? Your organization wouldn’t let you hire them. Your public cloud provider is going to have to include their costs in its pricing, so you will be paying for them. Can you really hire and maintain large numbers of people willing and able to do this work? It’s a real challenge.

A quick example of the cost issue. An Amazon RDS MySQL t2.micro instance you are paying for on a 3 Year All-Upfront Reserved Instance basis (i.e., the lowest public price) costs 18.5 CENTS PER DAY. So AWS grosses $185 a day for 1000 instances. Doing a back of the envelope calculation let me postulate the fully burdened cost of resolving the 1 failed backup a day for those 1000 instances is $90. That leaves $95 a day to cover all hardware and infrastructure costs, other failure conditions, cost of sales, software development, etc. In other words, it’s a huge money losing proposition. And that doesn’t even take into account the cost hit on the many t2.micros being used as part of the AWS Free Tier.

So what makes more sense as a tolerable failure rate for backups at hyperscale? To get back to the point where someone is paged once per day you’d need a failure rate of 1 in a million. Would it be reasonable at the million (or low millions) of instances to have a team of 3-5 people who handled failures? Perhaps. But the story doesn’t end there.

Lets talk about log backup failures. Databases offer Point-In-Time-Recovery (PITR), and if you want that to be within 5 minutes, it means you need to back up the log files at least that often. That’s 20 times per hour. So at 1 million instances you are doing 20 million log backups per hour. Yup, half a billion operations per day! So even at a 1 in a million failure rate, you still would be seeing 480 failures a day that needed a human being to step in. And we haven’t even begun discussing anything other than backup! This suggests that our target failure rate should not be 1 in a million, but rather 1 in a billion.

Of course, if we are already talking about a million instances, and we all know how fast the cloud is growing, then we are looking at where the puck is now while we should be focused on where the puck is going. We probably should be thinking about tens of millions of instances, and targeting failure rates of 1 in 10 billion, 1 in 100 billion, or even 1 in a trillion operations.

Earlier I made an assumption that a backup failure could be resolved in 15 minutes. There are a lot of assumptions built into that number. While I’m sure every DBA has had the experience that they looked at an issue, immediately recognized the problem, and ran a script to resolve it, they have also had the experience of spending hours or days resolving and cleaning up after a failure. We’ve known since the 80s that computer failures are largely the result of human error, and have been working ever since to address that. So not only do you have to target failure rates of 1 in billions, you have to target reducing the cost and potential for negative impact by human beings when they do have to get involved. And you need to do this in the context of very high security, availability, and durability goals.

I am using databases as an example to drive the discussion, but all of this applies to any area of hyperscale computing. At re:Invent 2017 AWS’ CISO, Stephen Schmidt, strongly made the point that AWS does not have a Security Operations Center. He talked some about how this is achieved, and Distinguished Engineer Eric Brandwine offered a deeper look. I wonder how low a failure rate they had to achieve to make it possible to eliminate the SOC?

In the next segment of this series I’ll dive into how the need to both achieve very low failure rates, and make resolution of those failures fast and foolproof, comes through in public cloud database offerings. That will cover some generic design patterns, but also deal specifically with the behaviors and feature sets of managed database services.

Posted in Amazon, AWS, Azure, Cloud, Computer and Internet, Microsoft | Comments Off

Microsoft “can’t win for losing”

Posted on January 6, 2018 by halberenson

When it comes to the consumer, Microsoft’s history can best be described as “I got it. I got it. I got it. <THUMP> I ain’t got it.”. Today is the 4th anniversary of my Xbox: Fail blog post, and this week Microsoft put the final nail in the coffin of Kinect. So it really is an appropriate point to talk about Microsoft and the consumer. Microsoft is not a consumer-focused company, and never will be despite many attempts over the decades. Recognition of this reality, and an end to tilting at windmills, is one of the things that Satya Nadella seems to have brought to the table.

First let’s get something out of the way, we need to refine what we mean by the label “consumer”. It isn’t simply the opposite of business/organizational users. Microsoft has always done just fine in providing individuals with personal productivity and content creation tools. The Windows-based PC remains at the center of any complex activity. Sure I book some flights on my iPhone or iPad. But when I start putting together a complex multi-leg trip the PC becomes my main tool. Office has done well with consumers, and continues to do so in spite of popular free tools from Google. And over the last few years Microsoft has gained traction with the artistic/design crowd that had always gravitated towards the Mac. So when we talk about the consumer we really are talking experiences that are left of center on the content consumption to content creation spectrum. Microsoft will always be a strong player on the right of center content creation scale, be it for individuals, families, or organizations. But, other than console gaming, they aren’t going to be a significant player on the left of center experiences. And Microsoft fans are going to go crazy over that.

The end of life for Kinect is the perfect illustration of Microsoft’s inability to be a consumer player. The Xbox One with (then mandatory) Kinect was introduced a year before the Amazon Fire TV and a year and half before the Amazon Echo. It was originally tasked with becoming the center of home entertainment, and offered a voice interface. Go read my Xbox: Fail piece for how it wasn’t ready to live up to that design center. It’s pretty typical Microsoft V1 stuff. Unfortunately the Xbox One was also V1 from a console gaming perspective, so Microsoft focused on making it more competitive in that niche and abandoned pushing forward on the home entertainment side. Imagine that, Microsoft had a beachhead of 10s of millions of voice-enabled devices in place before Amazon even hinted at the Echo, and failed to capitalize on it. You can repeat that story many times over the last 25 years.

It isn’t that Xbox One was the perfect device for the coming voice assistant, or streaming TV, revolutions. The need to be a great gaming console gave it much too high a price point for non-gamers. But Microsoft could have continued to evolve both the experience and produced lower priced, non-gaming focused, hardware. Contrast what Microsoft did with what Amazon did around the Echo. When the Echo was introduced it was considered a curiosity, a niche voice-operated speaker for playing music. When Amazon started to gain traction with the Echo and Alexa, they went all in, and as a result have a strong lead in today’s hottest segment of the consumer technology space. It reminded me a lot of Microsoft’s pivot to the Internet back in 1995. But in the Xbox One case, Microsoft had the vision (at least in general direction), but failed to capitalize on it. Failed to even make a serious attempt. Now, at best, it could fight it out for a distant 4th or 5th place in voice assistants and home entertainment. This consumer stuff just isn’t in Microsoft’s DNA.

The death of the Groove Music Service is another example, and maybe more telling on why Microsoft hasn’t been able to crack the code on the consumer. Groove is just the latest name for Zune’s music service. When MP3 players became popular Microsoft jumped on the bandwagon based on its DNA, it relied on 3rd parties that it supplied with technology (e.g., DRM). When that didn’t even turn out to be a speedbump on the iPod’s adoption, it finally introduced the Zune as a first party device. To have as good an experience as an iPod, the Zune needed an iTunes equivalent and what we now know as the Groove Music Service was born. Despite the jokes that failure often leads to, the Zune was a quite nice device. But since it couldn’t play the music you’d acquired with iTunes there really was no iPod to Zune migration path. By the time Zune came on the market the game was already over. As Zune died other consumer-focused device efforts came to the fore (Kin, Windows Phone 7, Xbox One) and the music service lived on. But since the devices never gained traction neither did the music service. And for Microsoft the music service was never a player on its own, it was just a necessary evil to support its consumer device experience. And with that mindset, the failure to gain traction with consumer devices meant Groove was superfluous. Sure Groove could have owned the segments that Spotify and Pandora now dominate, but that was never what Microsoft was going for. And now, it is too late.

Being a content creator or distributor is not in Microsoft’s DNA. It has an immune system that rejects it time and time again. Microsoft made a big play on consumer titles in the early to mid 90s, remember Microsoft Dogs and Encarta? Offerings like these are very manpower intensive because they need a lot of content production, editing, frequent updating, sell for very little, are expensive to localize, and often don’t even make sense globally. So Microsoft concluded they didn’t fit well with its business model and backed away from all but a few major titles such as Encarta. While Encarta was great for its time, the Internet left it competing with Wikipedia. That destroyed what little economic value Encarta had. Other content-oriented efforts, such as Slate, were disposed of to save costs when the Internet Bubble burst. The MSNBC joint venture was allowed to dissolve when its contract came up for renewal. And so on.

I could even say that great end user experiences are not in Microsoft’s DNA, though that one is more debatable. Usually it is thought of as being consistently second to Apple. So rather than saying they aren’t in Microsoft’s DNA, I’d say that Microsoft user experiences are almost always compromised by more dominant aspects of its DNA. And that keeps it from being a great consumer experience company.

What is Microsoft good at? Creating platforms that others build on. Doing work that is technically hard, and takes a lot of engineering effort, that it can sell over and over again. High fixed cost, very low variable cost, very high volume, globally scalable has been its business model all along. Consumer businesses usually have moderate to high variable costs, so there is problem number one. Only the top two players in a segment usually can achieve very high volume, so unless Microsoft achieves leadership early in a segment it never can get high enough volume to have a successful business model. A head-on charge against the established leaders rarely works, and when it does it is a financial bloodbath. So you may not need to be the first in market, but you need to be in early enough for the main land grab (or wait for the next paradigm shift to try again). And global scaling of consumer offerings is way more difficult than for platforms or business-focused offerings.

Microsoft seems to have resolved to focus on its DNA. It will be supportive, even encouraging, of third parties who want to use its platforms to offer consumer services but avoid going after the consumer directly. So you get a Cortana-enabled smart speaker from Harmon-Kardon, a high-end Cortana-enabled thermostat from Johnson Controls, a set of smart fixtures from Kohler that use Amazon’s Alexa for voice control but Microsoft Azure for the rest of their backend, and an agreement with Amazon for Cortana/Alexa integration.

Will Microsoft introduce consumer devices or services in the future? Possibly, but they will suffer the same fate as its earlier attempts. And I’m not throwing good money after bad (and I did throw a lot at every consumer thing Microsoft ever did). I recognize that these attempts are at best trial balloons, and at worst ill-advised ventures by those intoxicated at the potential size of market. Microsoft is an arms supplier. It should supply arms to companies going after the consumer, but avoid future attempts to fight consumer product wars itself.

Posted in Computer and Internet, Home Entertainment, Microsoft | Tagged Amazon Alexa, Groove, Xbox One, Zune | 11 Comments

Hal's (Im)Perfect Vision

Challenges of Hyperscale Computing (Part 3)

Keezel – Another Internet Security Device

Amazon and the ACLU

Amazon and Sales Tax

Adblockers are the new AntiVirus

Playing the Amazon Blame Game

The rise of custom chips

My Mobile Phone is Sacrosanct

Challenges of Hyperscale Computing (Part 2)

Microsoft “can’t win for losing”

Recent Posts

Archives

Categories

Meta

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Recent Posts

Archives

Categories

Meta