Documents 2.0
(Ok, first, I'm annoyed that everything is suddenly 2.0 - Web 2.0, Business 2.0, blah blah blah. So just had to get that in there. Unfortunately, Documents 2.0 might actually work as a descriptor for what follows, so I fear it could stick.)
This is a long one, you might want to grab a cup of coffee.
The unifying theme of XML 2005 was the connection between data and documents. This expressed itself in a number if different ways:
MODELING: Dr. Bob Glushko of UC Berkeley's Information School (what used to be the library science school) talked about "Modeling Methods and Artifacts for Crossing the Data/Document Divide." His bottom line: There aren't two clear types of content - data and documents. Instead, there is a continuum of content, and when we're analyzing it we should avoid thinking in terms of whether it will become XML or relational content until the analysis has been completed. This means, among other things, that we need modeling methods that support both relational and hierarchical relationships, and that can be translated into both XML and database schemas. We also need to apply the best practices of document modeling to data modeling, and the rigor of data modeling to document modeling. And we need to distinguish between the normalized form in which information is best captured for storage versus the ways in which it might be recombined for human consumption or delivery to other systems. You can read more about Dr. Glushko's ideas and his new book here: http://www.docengineering.com/.
STORAGE: IBM was present in full force at the show, which is a big change. They are promoting Viper, code name for the next version of DB2. It includes a native XML repository along with the traditional relational database, and the query engine has been made "bilingual": It understands both XQuery and SQL. And not only does it understand both, it understands combinations of the two - endless nestings of SQL inside XQuery inside SQL inside XQuery (looks messy, but not bad to read). It also allows for foreign key relationships between XML nodes and relational fields (a truly wonderful thing). Indexes can be created for XPath expressions (no predicates) to improve performance. IBM has produced some hard-hitting marketing literature regarding the superiority of their approach when compared to the options other relational vendors (umm, maybe Oracle and Microsoft?) have taken for XML storage. Presumably if they can make tough statements about the performance hit of those other approaches, it means that Viper itself will be relatively speedy at loading and querying XML. Apparently Viper also has some appealing schema management capabilities, including the ability to modify schemas without reloading content (sounds like a no-brainer, but other approaches can't do this). Viper is in beta right now (you can learn about the beta program here: http://www-306.ibm.com/software/data/db2/udb/viper/) and will be released towards the end of next year, but some partner firms are releasing support for Viper as soon as Q1.
EDITING: Microsoft gave a sneak preview of its next generation of Office, due out at the end of 2006. This is the first time since Office 97 that they're changing to a new file format, called "the Microsoft Office Open Format." And boy, is it a change. They are unraveling Office documents by storing their constituent objects (XML content in custom schemas, content in the Office schemas, images, charts, and so on) in independent files, each of which can be accessed, read, written to, and otherwise processed on its own. Relationship files describe how the objects fit together into a document. The objects, relationship files, and a file describing the file types are gathered together into a zip file with an appropriate Office extension (e.g., "docx"). (Apparently the compression significantly reduces file size, very nice.) Each Office application opens its own zip file format directly - most users will never even know that it's a zipped file. (Learn more here: http://blogs.msdn.com/brian_jones/.)
There's lots that cool about all this, assuming it behaves as advertised (big assumption, I know), but my main point in this post is that Microsoft clearly states that their purpose is not to create an alternative to native XML editors. Instead, the goal is to create an environment in which business users can more easily search data in back end systems and include that data in documents in ways that allow the data to be automatically kept up to date. For example, if a sales manager authors a document about her company's projected revenue, she can use an InfoPath form (XML under the covers, of course) that is embedded directly in the document she's working on (assuming a programmer did some work to set it up for her, presumably in a template) to report on the current numbers from all her salespeople, embed the result in her document, and refresh that number at will as the end of the quarter approaches. The data could also be delivered in the form of an XML document that is included in the document's zip package but not displayed directly - it just becomes a little data source living behind the scenes and traveling along with the document even when the system from which it was extracted is not accessible. (This is definitely cool.) From Microsoft's perspective, the goal is productivity gain through gluing data to documents. I'm trying not to get too excited given how disappointing the earlier Microsoft efforts to incorporate XML into Office applications have been, but this is looking really useful. And although they say they're not trying to replace XML editors, this approach could make it worth a publisher's while to re-think their DTDs/schemas - maybe the amount of structure and data integrity that can be achieved in this type of application is good enough for many needs, and the benefit of using Office applications doesn't need to be explained.
I have to say, I feel somewhat personally vindicated by all this discussion, as it's been an area of personal interest for many years. (I wrote a chapter for the XML Handbook in 1999 that talked about a lot of these very same issues, and presented a discussion on the continuum of data and documents at an AIIM conference a few years ago.) Of course, I'm not alone - most of us who have been working with SGML/XML for a long time ran into these issues a long time ago and have been waiting for technical solutions that would help solve them.
Now that the technology is almost here, there are some potentially even tougher challenges:
1 - When you have the choice of storing some content in XML and some in relational form, it means you have to come up with good rationales for what content is stored how. Turns out that's hard sometimes. There's a lot to say on this topic and I'll return to it another day.
2 - Second, this means that the practice of content modeling as described by Dr. Glushko (actually, he calls it "document engineering," which I think is unhelpful, see the next point) needs to mature, and someone will need to "own" the design and maintenance of the relational/XML models because they are too linked to be truly separate anymore. How will that work??? Are the people who control databases ready to inherit responsibility for DTD/schema maintenance? Could the reverse happen? No on both counts - today.
3 - Finally, we really need to stop using the word "document" for every XML instance we run into. Even Dr. Glushko used it for both stored XML documents - which presumably exist as an XML narrative in some permanent fashion and for some meaningful reason for the given system context - and for assembled documents that are created for more temporary purposes (presentation or delivery) by combining XML with data and other stored object types (like images). When discussing models, storage methods, and editing tools, it can be extremely confusing to use the generic "document" for both kinds of thing. We need more precise terms. Any suggestions?


Great post, interesting to hear what all is going on down in Atlanta.
It's really interesting to hear about what IBM is working on with Viper. It sounds impressive. I've been hoping to get a chance to suss out the real capabilities of SQL Server 2005 in terms of XML and XQuery, especially in relation to document-centric XML as opposed to data-centric (OK, so I guess I need to stop thinking that way, according to your post ;-) Do you know of any good articles, etc. that address those capabilities in SS2005?
Posted by: Mark Kennedy | November 18, 2005 at 11:26 AM
Hi Mark. I have been looking at SQL Server 2005 a great deal over the past month, and have been surprised with its XML capabilities. XQuery is supported in the new version, as is full text search, XML updates and hybrid queries. You can store true XML content in the XML data type and field out the data that you want to access in a relational fashion.
It does still require chunking the data into a relational table, and so the idea of it being a native XML database is a bit of a misnomer.
I have not seen many written articles, but there is some good information on the Microsoft web site here:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/sql2k5xml.asp
Posted by: Michael Puscar | November 18, 2005 at 05:54 PM
Thanks for saying mostly nice things about my Document Engineering work... and you're right, the word "Document" has to stretch a bit to fit across the Data/Document divide and I'm not thrilled about having to do that. But I don't think there are any better alternatives. Some people talk about "content engineering" which does seem to span the data/document divide, but to me that implies a focus on the instances rather than the models to which the instances conform. In Document Engineering (see docengineering.com, buy the book, and let me know what you think!) Tim McGrath and I primarily are concerned with the development of the component and document models, not of the "content" that goes into them.
Posted by: Bob Glushko | November 20, 2005 at 03:26 PM
Bob - Only nice things were intended! Your point makes perfect sense to me. I'll read the book before making any more comments. Seriously, though, we've ordered a copy and will give you a review here at some point.
Posted by: Lisa Bos | November 20, 2005 at 10:39 PM
You can find more information about the XML and XQuery support [in MS SQL Server] on my weblog (and linked resources from there): http://www.sqljunkies.com/weblog/mrys.
Posted by: Michael Rys | November 26, 2005 at 01:44 AM