Other than a brief comment on Map-Reduce back in 2008 I’ve avoided commenting on the topics of NoSQL or Big Data. That’s for two reasons. First, I really didn’t have much of an opinion because second, I intentionally have stayed away from the database arena for several years. The reason I stayed away, by the way, is that I had other interests I wanted to explore and every time I stick my toe back into the database world I get pulled in 200%. But a friend kept bugging me for an opinion on NoSQL, and so I’m going to give it.
Once upon a time there wasn’t a Software Engineering discipline, and few schools taught Computer Science. Programming was largely self-taught and narrowly focused. For example, a physicist might teach themselves Fortran so they could perform some computations for a problem they were working on. As the use of computers expanded in the 60s and 70s there was a deep division in the ranks of computer programmers, with a small number of “System Programmers” supporting a very large number of “Application Programmers”. System Programmers tended to be deep technologists, perhaps with Computer Science educations, while Application Programmers had modest (self-paced or occupational-style) training and were more subject matter experts (e.g., they understood retailers’ business processes) than computer scientists. The System Programmers would, for example, write libraries of routines that isolated the Application Programmers from the gory details of the underlying systems. As software technologies advanced, for example the introduction of database management and transaction processing systems, the gap between the average Application Programmer’s abilities and the demands of these new software systems grew to the breaking point. And there weren’t enough System Programmers to bridge the gap. One response was that Colleges started to focus more on teaching programming and related technologies (e.g., data structures), both producing more Computer Scientists (and later Software Engineers) as well as information technologists (e.g., BA in Management Information Systems). The other was for vendors to subsume the role of the System Programmer and create new generations of software that were more flexible and more easily used directly by Application Programmers. It turned out Ted Codd’s Relational Model could address this gap in the database world, and so in the 1980s the race was on to turn the relational model into practical products. By the mid-1990s Relational Database Management Systems (RDBMS) had become the predominant enterprise database management system, and by the mid-2000s were dominant in every aspect of computing from mobile phones to the largest data centers. SQL, which had become the standard (but not only) language for formulating database requests, is now part of the technology that even the self-taught hobbyist programmer learns and uses. RDBMS, and SQL, had won the day.
One of the problems with having a standard (e.g., ISO SQL 2008 being the current formal standard, although ISO SQL 1992 remains the most widely implemented), being it formal or philosophical, is that product evolution can be slow. For example in the case of distributed systems the SQL single-system image philosophy (meaning the application program can’t tell a distributed database from one running on a single system) has kept this technology from advancing much beyond where it was in the early 1990s (and in fact it has actually regressed, with fewer vendors devoting much energy to it). Meanwhile business problems have been moving fast, and particularly the explosion of data being generated and captured over the last decade; the so-called Big Data problem.
Big Data (an explosion of data volumes, sometimes needing real-time processing, and often not lending itself to the structuring rules and processes implemented by RDBMS) went from a theoretical discussion to overwhelming problem in less than a decade. Those who needed to address this problem quickly created their own solutions, such as Google’s Map-Reduce and the Hadoop open source solution that it inspired (and is now the focus of most Big Data efforts). These solutions became known as NoSQL.
I find a couple of things interesting about NoSQL. The first is that custom, special-purpose, or non-relational efforts are nothing new but NoSQL is the first to gain real traction. Many Software Engineers have preferred to attempt their own data storage solutions over using a packaged one, but generally they fail to understand just how difficult it is to reproduce the technologies already incorporated in RDBMS. And most non-relational attempts at products usually get superceded by relational systems. For example, Object-Oriented Database (OODB) products had slow adoption (outside the CAD/CAM world) and the relational guys eventually incorporated object-oriented features condeming pure OODBs to the dustbin of history. The same has largely happened with dedicated XML databases. (Note this is similar to what happens with Moore’s Law. Most times you build special purpose hardware you find that Moore’s Law allows general purpose processors to surpass it in performance and cost within a generation and thus render the special purpose hardware obsolete.) But NoSQL has, so far, defied this trend.
The second thing about NoSQL that is interesting is that in a world in which people (i.e., programmers) are expensive but computers are not a people-intensive technology should not be gaining traction. What’s changed? Well, first of all we’ve gone from a world in which most programmers learned from a self-paced programming course to one awash in hundreds of thousands (if not millions) of programmers with Computer Science/Software Engineering (or similar) degrees. And we have a set of companies, including Google, Microsoft, Facebook, etc. that are willing to hire thousands of the best and brightest of these people. And so when it comes to solving an immediate, high-value, problem the path of least resistance is to throw people at it. And right now a lot of companies have a Big Data problem that is overwhelming them and the expertise to use NoSQL technologies to address that problem.
But “regression to the mean” applies to the Big Data problem just as it has to earlier problems. Most organizations can’t afford (or don’t have) the talent to exploit today’s NoSQL technologies, and even those that do will grow tired of the expense. The race is on to make the Big Data problem more tractable for organizations with less expertise and resources than the Googles and Facebooks of the world. Most of those efforts build on NoSQL. Even Michael Stonebreaker’s VoltDB, the relational world’s real first shot across the bow of the NoSQL movement, has announced integration with Hadoop. Stonebreaker was also the leader of the RDBMS industry’s counter-assault on OODB with his Postgres research and Illustra product (as well, of course, as being one of the original RDBMS pioneers). Will lightning strike twice (or rather, a third time)?
One thing that NoSQL has going for it is that the big relational vendors (Microsoft, Oracle, and IBM) have all adopted Hadoop for their Big Data efforts. No matter what their long-term plans in Big Data might be they saw the rapid customer takeup of the Hadoop technology and didn’t want to be left behind. Because Hadoop is open source it was easy to jump on board. Not only that, they have a lot of customer needs to meet and trying to advance transaction processing and data warehousing capabilities to meet customer needs is already taxing their ability to evolve RDBMS products. Having a separate effort around Big Data (e.g., as Microsoft did for its OLAP store that evolved into Analysis Services) is one way to allow quick movement without risking killing their core product.
The key question on the table is what the long term approach is to the Big Data problem and will NoSQL dominate or be supplanted by another generation of SQL-based products? When I first looked at Map-Reduce back in 2008 my view was that Google had essentially extracted a primitive that you’d find in a distributed RDBMS and exposed it for direct use by programmers. Will Hadoop simply become a processing environment that most people access exploit through a SQL RDBMS? Certainly that’s the way the relational world seem to be going (e.g., Microsoft SQL Server and Rainstor as well as VoltDB). Although their offerings are currently basically connectors, I think in the long run relational database vendors will treat Hadoop as an operating system service that they hide under the covers of new capabilities in their core product offerings. Moreover I think they’ll add many more Big Data features to their core relational product offerings. The combination will make SQL-based rather than NoSQL-based solutions the primary way most organizations attack their Big Data problems.
Now the real reason I come to this conclusion is not because I’ve written a couple of RDBMS, but because of all the startups in the Big Data space. There are dozens, perhaps hundreds, of companies that are trying to make Big Data a more approachable problem to bring it to a larger audience. They are looking to solve many of the same problems that lead the industry to abandon navigation-based DBMS for relational DBMS. They are starting to add features from SQL to Hadoop. NoSQL will eventually get to a place where they are offering an environment that is comparable to, and yet arbitrarily different from, a SQL RDBMS. At which point, if the RDBMS vendors do their job, one will question why not simply use a SQL RDBMS to begin with?
If I did the Gartner thing and assigned probabilities to outcomes I’d assign my prediction a probability of 0.6, meaning I’m not all that confident. The momentum behind NoSQL is rather strong, and with the relational vendors helping it along it is quite possible that alternative solutions will gain unassailable positions before RDBMS-centric solutions can catch up. But I am placing my bet nonetheless.