People have been bugging me to write about Integrated Storage for some time, and with Bill Gates having just disclosed that failure to ship WinFS was his biggest product regret now seemed like a good time. In Part 1 I’ll give a little introduction and talk about scenarios and why you’d want an Integrated (also refered to as unified) Store. In a future part (or parts) I’ll talk more about Microsoft’s specific history trying to tackle this problem and what I think the future holds.
To position myself in all this, of the five attempts the Microsoft made at directly attacking this problem I had a hand in three of them as well as helping with a lot of the ancillary strategy. My last position before leaving Microsoft the first time was as the General Manager of what became known as WinFS, so I have a lot of insight into how it started but only limited second-hand knowledge about how it ended.
I’ve noticed that a lot of people on the periphery have made comments that they never understood what WinFS or, more broadly Integrated/Unified Storage, was about. The common thread being that anyone listening to a description came away with the impression that it was about “search”. Now maybe that is to be expected given the simplest scenarios that people presented. In fact, maybe Bill was most responsible for this.
When trying to express his frustration over the multiple stores situation at Microsoft Bill would use an example of “I know I saw a spreadsheet a couple of weeks ago; when I want to find it again do I look in my file system or do I look in my email?”. Bill was trying to make multiple points with this simple example, but the primary one was not that there should be a way to search across disparate stores. His primary frustration was that spreadsheets were stored in many different places each with their own semantics, APIs, “contracts”, management tools, and user experiences. If you can’t solve the simple problem that Bill expressed of knowing where to look, then how can you hope to solve the problems involved in complex collaborative information worker scenarios or interoperable multi-data type enterprise applications?
So making it easier to find information was a critical goal of any of the integrated storage efforts. By the way, this should be no surprise as the Integrated Storage efforts grew out of the vision for “Information At Your Fingertips”. Nor should it be a surprise that Bill was focused very much on end-user scenarios given the IAYF vision and Microsoft’s background. At the time of the first integrated storage effort, Cairo’s Object File System (OFS), Microsoft had no presence in the enterprise server or apps space. So many scenarios that drove integrated storage were end-user scenarios. Often those were Information Worker scenarios, but sometimes they were Consumer scenarios.
A somewhat simple set of consumer scenarios. and one that was a big focus for WinFS, was around the storage of photos. Let’s say you are on a trip and take a bunch of photos. You take photos at the wedding you attended, and photos of your kids at Disney World, and photos of a launch from Kennedy Space Flight Center, and some pictures late one night at the hot tub that no one but you and your spouse should see. Now you transfer them to your computer and store them in the file system, but how can you organize them? The file system provides very few tools for doing so. They get stored with a meaningless file name, any given photo can be in only one place (and by default just as a collection from that download), and they have a fixed set of attributes that the file system knows about (e.g., creation date). But you want photos that live in multiple places. For example, you might want an album with pictures of Aunt Jean. But you also want the pictures of Aunt Jean at the wedding to be in the wedding album. You also want to share about 50 of the 500 photos you took (and make sure you don’t share any of the hot tub pictures). How do you do that without copying the pictures to a separate share location? Maybe you want to organize photos from all visits to Disney World together, but also keep them together by broader trip.
So integrated storage is about creating a rich organizational system. One that isn’t tied to the rigid structure of file systems but rather to the organizational principles of the domain, application, and/or user preference. Of course you also want to be able to find photos by far richer information than a file system stores in its metadata. Perhaps tagged by the camera it was taken with or the person who actually took the shot. Perhaps you want to query for photos taken within 50 miles of particular GPS coordinates. And so on. Thus search is very important and enabling rich searches based on semantics rather than simply pattern matching is important.
You can solve many of the problems I described for photos by putting an external metadata later on top of the file system and using an application or library to interact with the photos instead of interacting directly with the file system. And that is exactly how it is done without integrated storage. This causes problems of its own as applications typically won’t understand the layer and operate just on the filesystem underneath it. That can make functionality that the layer purports to provide unreliable (e.g., when the application changes something about the photo which is not accurately propagated back into the external metadata store). And with photos now stored in a data type-specific layer it is ever more difficult to implement scenarios or applications in which photos are but one data type.
Let me cross over into the enterprise app space and talk about an Insurance Claims Processing scenario. Claims processing is interesting for a number of reasons, they key one being that it was one of the first enterprise applications to really embrace the notion of multiple data types. When you file a claim, for a car accident for example, it goes into a traditional transactional database system. But each claim has an associated set of artifacts such as photos of the accident scene, the police report, photos taken by the insurance adjuster, photos taken at the repair shop, witness statements, etc. that don’t neatly fit into the classic transactional database. Yes you can store these artifacts in a database BLOB, but then they lose all semantics. Not only that, you have to copy them out of the database into the file system so that applications that only know how to deal with the filesystem (e.g., Photoshop) can work against them. And copy them back. That creates enormous workflow difficulties, introduces data integrity problems, and prevents use of functionality that was embedded in the photos storage application.
The claims processing scenario is one that demonstrates where the name integrated storage came from. What you really want is for the same store that holds your transactional structured data about a claim to hold the non-transactional semi-structured artifacts, and not just as blobs. You want the semi-structured artifacts to expose their metadata and semantics to the application, or applications, built on that store. As soon as you do that the ability to create richer apps, and/or use the data in complex information worker scenarios, climbs dramatically.
Rather than just using the photos as part of processing a specific claim they now become usable artifacts for risk analysis, fraud analysis, highway planning, or any number of other applications. Data mining applications could run against them seeking patterns that weren’t captured in the transactional data. Indeed all kinds of linkages could be made amongst the photos, police reports, etc. that just aren’t possible from the transactional data alone.
The multi-data type scenarios are huge in the information worker world and we’ve developed numerous application level technologies to deal with them. OLE, for example, allows you to embeded one Office data type within another. ODBC started out life as a way to bring structured data into Excel. But these application-layer solutions have significant flaws. They basically use an import model and you generally aren’t looking at the actual data but rather at a snapshot. And you’ve probably discovered times where it was impossible to refresh document with current information because you didn’t have access to the location where it was stored. Imagine submitting a settlement brief in a legal case to the judge with the numbers being out of date because of the complex series of steps from an ODBC query populating an Excel spreadsheet that is then embedded in a Word document and somewhere along the lines something didn’t update. This could be a disaster.
Even organizing data for information worker projects is difficult. Imagine you are building a proposal for a new business. How do you organize and control all the artifacts amongst a set of people working on the project? Sharepoint will do this for you, by creating another store on top of underlying stores. Each application must understand how to work with a Sharepoint-like document management system (DMS), or the end-user must use a checkin/checkout system to copy artifacts from the DMS into the fileystem and then put them back.
How about another simple task, like setting up a video conference between a few people in your company and a few at a customer? Contact information about your peers is stored in your company’s Exchange Server and the scheduling is done via Outlook, but your customer contacts are stored in a CRM system. Working with the different sets of contacts can be painful, often involving cut and paste rather than seemless operation. And this is a case where the CRM vendors actively work to integrate with Outlook. Imagine you have a CRM system that hasn’t written a specific Outlook extension. Where the names of common data elements aren’t the same. And when they are the same, where the data formats for them differ. Today we largely treat contacts as an MDM problem, with problem being the operative word. For example, I recently noticed that one of the email addresses I have for Microsoft’s Dave Campbell is actually the email address from another of our former DEC colleagues. Another Dave. Some tool mistakenly merged it into my contact record for Campbell.
Finally let me give a system management scenario. Many systems that need to combine structured (i.e., typical database data) and semi-structured/unstructured data (e.g., a photo or document) do so by having the database contain a pointer (e.g., URI) to the unstructured data. How do you backup and restore this data in a consistent manner? Imagine going to repair an aircraft and having the diagram associated with the area you are working on is out of sync with the database that contains information on the set of changes that have been applied to that specific aircraft. Without a storage system that can be the primary store for structured, semi-structures, and unstructured data types you always have the situation of being unable to manage the collection of data that make up an application as a unit.
So what is Integrated Storage? It is taking the storage concepts necessary to address these kinds of scenarios and moving them from the application layer, where each application addresses them individually, into a storage layer where they are addressed in a common way. It is a storage system that provides rich and flexible organization, sharing, extensibility, discoverability, control, and manageability across the entire spectrum of data types that need to be stored.
At Microsoft Integrated Storage has repeatedly shown up positioned as a new file system (e.g., WinFS), which many see as a pejorative. There are hints of why you’d want to do this at the file system level in many of my scenarios. So I’ll start off Part II by drilling in to why this is, and why it has been the pivot point on which all attempts to create an Integrated Storage system have failed.
And for those who found this section to be too much rambling I apologize. If I were doing this as a formal paper or presentation I’d go through scenarios first in a more pure form and then get into problems with current solutions. But this is a blog, so you get to live with stream of concience and my time constraints on cleaning it up.