WinFS, Integrated/Unified Storage, and Microsoft – Part 3

Although there are several ways to interpret the phrase “integrated storage” (or “unified storage”) one of the most important ones to focus on is that it creates a single store for Unstructured, Semi-Structured, and Structured types of storage.  The differences between these storage types, often seemingly small, are at the core of the technical, engineering, and political challenges involved in creating a new store.  So before diving into the history of Microsoft’s efforts it is valuable to discuss these three types of storage.

Unstructured Storage, the classic storage provided by operating system file systems, is something I’ve already discussed quite a bit in the previous parts of this series but want to add more clarity.  File Systems historically treat files as a bag of bits which can only be interpreted by an application.  They concern themselves with making it very fast to open a file, allocate space to it, stream bits to and from the file, and navigate to specific starting points in the file for performing streaming.  They also pay a lot of attention to maintaining the integrity of the storage device on which the file resides, and of providing certain very specific behaviors upon which an application (which might include a DBMS) can build more robust integrity.

The developers of File Systems tend to rebel against changes that violate the basic Unstructured Storage premises.  They want a very restricted fixed set of metadata about a file so they can make File Open very fast.  They don’t want to introduce concepts that require a lot of processing in the path of moving data between a raw storage device and the application (or network stack in the case of TransmitFile).  They don’t want to introduce complexity into kernel mode that risks the overall reliability of the operating system.  And they pay a huge amount of attention to the overall integrity of a “volume” and what happens when you move it between computer systems.

It isn’t that File System developers haven’t responded to pressures for richer file systems, it is that they have done so in very careful and precise ways that mirror their core mission.  At DEC, for example, they introduced Record Management Services (RMS) to add some measure of structure on top of the core file system.  RMS turns a bag of bits into a collection of records of bits.  In the case of keyed access a set of bits within the records could be identified as a key which was then indexed allowing retrieval by key.  But once a record was retrieved the application was responsible for interpreting its contents.  Importantly RMS existing as a layer on top of the core file system, and didn’t run in kernel mode.

At Microsoft you can see numerous ways that the File System team tried to accommodate greater richness in the file system without perverting the core file system concepts.  For example, the need for making metadata dynamic or adding some of the things that the Semi-Structured Storage world needs was met by adding a secondary stream capability to files.  That is, the traditional concept of a file was that you had a single series of allocation units pointed to by a catalog entry.  NTFS gained the ability to have that catalog entry point to more than one stream of allocation units.  The primary stream represented the file as we normally know it.  An application could attach another stream to the file to hold whatever it wanted.  The file system guys really didn’t care, and they didn’t interpret the stream.  So this was a very natural extension.  They also created File System Filters as a means to allow extensions to the file system without modifying the core file system itself.

From an engineering and political standpoint you can see what might happen when you start discussing replacing something like NTFS with an Integrated Storage solution like WinFS.  How does it impact the boot path of the operating system?  How does it impact the reliability of the operating system?  What happens to scenarios like Web Servers or Network File Servers, which serve up bags of bits using standardized protocols.  And are evaluated by benchmarks, and against competition, that will neither benefit from nor suffer the cost of a richer file system?  How would the new file system impact minimum system requirements?  Does the namespace cross multiple volumes?  How would that impact the portability of volumes?  All very good questions that need to be addressed.

The natural progression would be to talk about semi-structured storage next, but since it is the youngest of the storage types I’ll first focus on Structured Storage.  While the file system guys have always treated files as a bag of bits, applications need some way of interpreting those bits.  That knowledge can be completely encapsulated in the application itself, or parts of it can be shared.  One of the earliest motivators of the library mechanisms we find in programming languages today was as a way to share the definitions of how to interpret the contents of a file.  COBOL’s Copy statement was a prime example.  Data Dictionaries, and their modern evolution to being a Repository, were further evolutions of this concept.  To commercial data processing, as opposed to technical/scientific, applications a file was a collection of records each of which adhered to a specific format.  That format information was shared across any application that desired to process the file.  So you had a customer file with customer records.  Each record was xxx bytes long.  The first two bytes contained an integer Customer ID, the next 30 bytes had a Customer Name, etc.

Pretty soon this evolved to deal with the fact that apps didn’t processes one file with one record type.  You had orders, and order line items, and part, and the bill of materials for those parts, and inventory information, and the customer, and customer contact information, and so on.  You needed to manage and share these as collections.  Then notions of cross-file integrity entered the picture and transactions, logging/recovery, etc. were added.  And there was recognition that apps not only didn’t care about the physical structure of the “files”, putting that knowledge in apps made it hard to evolve them.  So separation of logical file and physical file ensued.  And making every app responsible for the integrity of the data lead to logical data integrity problems, so the ability to pull some of that responsibility into what is now called a database management system was added.    And application backlogs became a key problem so there was a push for reporting and query tools that allowed non-programmers to make use of the data collection.  And high-productivity “4GL” development tools to allow lower-expertise programmers to write applications.  And this all lead to the modern concept of a relational database management system.

So when we talk about Structured Storage we are talking about the classic database management concepts.  We’ve replaced Files/Records with Tables/Rows.  Each table has a well known logical structure that each row in the table conforms to.  There are good mechanisms for making tables extensible, such as adding a new data element (column) that is “null” in rows in which no value has been specified.  And a relational database by its nature transforms tables into other tables so we can actually have virtual table definitions (or views) that applications use.  But basically we are talking about groups of things with well known, externally described, structure.

Most of the world of commerce we are used to was made possible by the creation and growth of the concept of Structured Storage.  The modern world of Credit Cards and ATMs is 100% predicated on this work.  Amazon.com was in the realm of science fiction in the 1940s.  By the 1970s the conceptual basis for everything you needed to create it was in place.  It took until the 1990s for those concepts to mature sufficiently to let Amazon happen.  For structured storage we had database management system concepts and (hierarchical and network) implementations appear in the 1960s.  Ted Codd described the relational model in 1969, and during the 1970s the System R and Ingres projects explored how to implement his model.  They also defined most of the integrity concepts we take for granted today such as ACID.  But it wasn’t until the late 1980s that relational database management systems, which found their earliest adoption in “decision support”, became suitable for transaction processing.   And it was the 90s by the time they were the preferred solution for high performance transaction processing.

Moreover, it wasn’t until the late 90s that developers in all application areas embraced relational database management systems.  In fact, in the mid-90s most applications that weren’t clearly in the commercial data processing camp preferred to use unstructured storage even when they were storing structured data.  Today we have smartphone applications using SQLite (and other small relational systems) as a primary means of storage.  My how Structured Storage has evolved.

During the commercialization of relational database management systems (RDBMS) in the 1980s it was recognized that not all data you’d want to store in them was actually structured.  During the development of DEC’s Rdb products Jim Starkey invented the concept of a BLOB (Binary Large Object) as a way to store this data, a concept that was embraced by virtually all RDBMS.  The simple idea here was that you could do something like store an employee’s picture in a blob that was logically inside the employee’s row in the Employee table.  Other ideas quickly developed, such as a document management system with the documents stored in blobs.  But blobs were rather weakly implemented and received minimal attention from RDBMS development groups.  This will play an important role in our later exploration.

Meanwhile I third category of storage had emerged, primarily out of the Information Worker environment, called Semi-Structured Storage.  I like to think about this as having two periods of evolution.  In the first, files remained a bag of bits whose internal structure was private to an application but that also carried around a set of public metadata.  In the second, the internal structure was exposed to any application though they might not be able to actually operate on it.  The latter is the world brought about by XML and I’ll discuss that a bit later.

So what are examples of Semi-Structured Storage?  A Microsoft Word document is one.  Forget that today Word documents are stored as XML using the Open XML standard, they used to be a fully proprietary binary format.  But they exposed metadata such as Title, Author, etc. as a group of Properties known as a Property Bag.  In other words, they promoted certain information from their private format to a publicly accessible one.  Email is another example of something in which there is the content of the message and then a set of metadata about the message.  Who sent it, who was it sent to, what is its Read/Unread status, etc.  For something non-IW think about JPEG files.  There is the image and then there is a set of properties about the image.  Things like the camera it was taken with, GPS coordinates, etc.   Applications, including the Outlook or the Windows Shell, can make use of these Property Bags without having the ability to interpret the contents of the file itself.

One of the characteristics of a Property Bag is that new properties can be added rather arbitrarily.  A law firm might create a “CaseNumber” property that it requires employees to tag all Word documents with.  Or Nikon could add specific properties about photos taken with their cameras to a JPEG image that neither the standard defines nor that any app other than their own could make sense of.  But it’s not just top level organizations that can define properties, anyone can.  So the PR department can define a property for its documents such as “ApprovedForRelease” with values such as “Draft” or “Pending” or “Approved”.  Or an individual could define a property such as “LookAtLater” for email messages.

The notion of a Property Bag seems easy enough and painless enough to understand, but it clashes with the world of Structured Storage.  How does arbitrary definition of metadata clash with a world in which schema evolution is (mostly) tightly controlled?  Do you add a column to a table every time someone specifies a new property?  If two people create properties with the same name are they the same property?  If a table with thousands of columns, all of which are Null 99.99% of the time, seems unwieldy then what is an alternate storage structure?  And can you make it perform?

XML didn’t exist until 1998, so when I start talking about Microsoft’s Integrated Storage history it is important to note that it didn’t play a role in the first two major attempts at a solution (OFS and JAWS).  Prior to XML it was assumed that either a file was explicitly a semi-structured storage type (with a Property Bag, stored in a secondary stream for example) or implicitly one because an application-provided content filter (IFilter) could extract the Property Bag from a proprietary bag of bits.  In either case the application controlled the set of properties that were externalized.  With XML though anyone can examine and process the content of the file, making arbitrary structured storage-like queries possible.  The world of semi-structured storage exploded.

There are numerous ways one can combine these three views of storage.  BLOBs were an early attempt to address use cases where unstructured storage was needed in an application that was based on structured storage.  My “ah ha” moment around the importance of XML came during a customer visit and involved a favorite (from the earliest days of my career) application, Insurance Claims Processing.

During the waning days of SQL Server 7.0 Adam Bosworth approached me about this new industry effort, XML, that he and his team were driving.  XML as an interchange effort made a lot of sense, but as a database guy I was a skeptic on using it to store data.  So I set up a series of customer visits to early adopters of XML.  One customer was using it in an insurance claims processing app to address an age old problem.  The claims processing guys were evolving their application extremely rapidly, must more rapidly than the Database Administration department could evolve the corporate schema.  So what they would do is store new artifacts as XML in a BLOB they’d gotten the DBA’s to give them and have their apps work on the XML.  As soon as the DBA’s formalized the storage for an artifact in the corporate schema they would migrate that part out of the XML.  This way they could move as fast as they wanted to meet business needs, but still be good corporate citizens (and share data corporate-wide) when the rest of the organization was ready.

I returned from that trip convinced we had to add formal support for XML in SQL Server 2000.  So convinced that I encouraged my boss to bring Adam into the SQL organization and combine his efforts with others to create the Webdata org.  And, in a move that caused some consternation with the rest of the Server team, let the Webdata team make changes to the relational server code base.  And so independent of, though actually very much in line with integrated storage thinking, SQL Server was on its way in semi-structured storage.  Something I’ll return to in Part 4.

The existence of three types of storage, three sets of often conflicting requirements, three (or more) shipping product streams with different schedules, three classes of experts who deeply understood their type of storage but not both of the others, and three organizational centers of activity for those types of storage would make trying to create an Integrated Storage solution a continuing challenge.  It actually gets worse though in that various efforts which weren’t specifically under the storage or integrated storage umbrellas had deep overlap with storage.  Hailstorm is one example,  And it seemed like everyone in Microsoft had their own sync/replication service.  What was different about WinFS is that most of these barriers, including the organization structure, were addressed.  And the failure to deliver an Integrated Storage File System when the conditions were as close to ideal as they’ll ever be is why the concept will probably never be realized.  Meanwhile the world of storage has moved on in interesting ways.

In the next part of this series I’ll go through the actual history of Microsoft’s efforts.  Depending on its length I’ll either wrap up there with thoughts about the future or finish up with a fifth part.

 

About these ads
This entry was posted in Computer and Internet, Database, Microsoft, SQL Server, Windows and tagged , , , , . Bookmark the permalink.

6 Responses to WinFS, Integrated/Unified Storage, and Microsoft – Part 3

  1. Bob - former DECie says:

    Thanks for the review of things.
    Didn’t some of IBM’s small to mid-range systems come with a database-as-the-file-system option back in the mid-90’s?

    • halberenson says:

      Yup, the System/38 and its AS/400 follow-on didnt expose a traditional file system. They were enirely based on database storage. The S/38 was unique in so many other ways too

  2. Harry says:

    I (like many other people) read this blog in RSS feed. So do not get a chance to come here and comment :) I should probably visit the origin more often. Thanks for taking your time out and writing this. Looking forward to the next posts.

  3. Just wanted to say that I’m following all these posts, Hal :) Thank you very much for putting so much of attention to the details of the Integrated Storage problem. It is an invaluable source of knowledge, and food for thinking.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s