DITA Viewed as Domain-Specific Language

Eric Armstrong at Sun Microsystems has written an interesting blog post about the power of domain-specific languages (by which he means programming languages and data modeling languages), citing Ruby and DITA as two current examples: http://blogs.sun.com/coolstuff/entry/domain_specific_languages.

One of his main points is that part of the power of domain-specific languages is that they tend to foster a supporting infrastructure that makes using them significantly less expensive than more general solutions.

This is certainly true for DITA.  If you've ever had the experience of designing and building a non-trivial XML application from scratch, you know how much easier it is to build on DITA, partly because of its inherent technical design, but also because of the amount of support infrastructure that is available that, more or less, just works.

Documents 2.0

(Ok, first, I'm annoyed that everything is suddenly 2.0 - Web 2.0, Business 2.0, blah blah blah. So just had to get that in there. Unfortunately, Documents 2.0 might actually work as a descriptor for what follows, so I fear it could stick.)

This is a long one, you might want to grab a cup of coffee.

The unifying theme of XML 2005 was the connection between data and documents. This expressed itself in a number if different ways:

MODELING: Dr. Bob Glushko of UC Berkeley's Information School (what used to be the library science school) talked about "Modeling Methods and Artifacts for Crossing the Data/Document Divide." His bottom line: There aren't two clear types of content - data and documents. Instead, there is a continuum of content, and when we're analyzing it we should avoid thinking in terms of whether it will become XML or relational content until the analysis has been completed. This means, among other things, that we need modeling methods that support both relational and hierarchical relationships, and that can be translated into both XML and database schemas. We also need to apply the best practices of document modeling to data modeling, and the rigor of data modeling to document modeling. And we need to distinguish between the normalized form in which information is best captured for storage versus the ways in which it might be recombined for human consumption or delivery to other systems. You can read more about Dr. Glushko's ideas and his new book here: http://www.docengineering.com/.

STORAGE: IBM was present in full force at the show, which is a big change. They are promoting Viper, code name for the next version of DB2. It includes a native XML repository along with the traditional relational database, and the query engine has been made "bilingual": It understands both XQuery and SQL. And not only does it understand both, it understands combinations of the two - endless nestings of SQL inside XQuery inside SQL inside XQuery (looks messy, but not bad to read). It also allows for foreign key relationships between XML nodes and relational fields (a truly wonderful thing). Indexes can be created for XPath expressions (no predicates) to improve performance. IBM has produced some hard-hitting marketing literature regarding the superiority of their approach when compared to the options other relational vendors (umm, maybe Oracle and Microsoft?) have taken for XML storage. Presumably if they can make tough statements about the performance hit of those other approaches, it means that Viper itself will be relatively speedy at loading and querying XML. Apparently Viper also has some appealing schema management capabilities, including the ability to modify schemas without reloading content (sounds like a no-brainer, but other approaches can't do this). Viper is in beta right now (you can learn about the beta program here: http://www-306.ibm.com/software/data/db2/udb/viper/) and will be released towards the end of next year, but some partner firms are releasing support for Viper as soon as Q1.

EDITING: Microsoft gave a sneak preview of its next generation of Office, due out at the end of 2006. This is the first time since Office 97 that they're changing to a new file format, called "the Microsoft Office Open Format." And boy, is it a change. They are unraveling Office documents by storing their constituent objects (XML content in custom schemas, content in the Office schemas, images, charts, and so on) in independent files, each of which can be accessed, read, written to, and otherwise processed on its own. Relationship files describe how the objects fit together into a document. The objects, relationship files, and a file describing the file types are gathered together into a zip file with an appropriate Office extension (e.g., "docx"). (Apparently the compression significantly reduces file size, very nice.) Each Office application opens its own zip file format directly - most users will never even know that it's a zipped file. (Learn more here: http://blogs.msdn.com/brian_jones/.)

There's lots that cool about all this, assuming it behaves as advertised (big assumption, I know), but my main point in this post is that Microsoft clearly states that their purpose is not to create an alternative to native XML editors. Instead, the goal is to create an environment in which business users can more easily search data in back end systems and include that data in documents in ways that allow the data to be automatically kept up to date. For example, if a sales manager authors a document about her company's projected revenue, she can use an InfoPath form (XML under the covers, of course) that is embedded directly in the document she's working on (assuming a programmer did some work to set it up for her, presumably in a template) to report on the current numbers from all her salespeople, embed the result in her document, and refresh that number at will as the end of the quarter approaches. The data could also be delivered in the form of an XML document that is included in the document's zip package but not displayed directly - it just becomes a little data source living behind the scenes and traveling along with the document even when the system from which it was extracted is not accessible. (This is definitely cool.) From Microsoft's perspective, the goal is productivity gain through gluing data to documents. I'm trying not to get too excited given how disappointing the earlier Microsoft efforts to incorporate XML into Office applications have been, but this is looking really useful. And although they say they're not trying to replace XML editors, this approach could make it worth a publisher's while to re-think their DTDs/schemas - maybe the amount of structure and data integrity that can be achieved in this type of application is good enough for many needs, and the benefit of using Office applications doesn't need to be explained.

I have to say, I feel somewhat personally vindicated by all this discussion, as it's been an area of personal interest for many years. (I wrote a chapter for the XML Handbook in 1999 that talked about a lot of these very same issues, and presented a discussion on the continuum of data and documents at an AIIM conference a few years ago.) Of course, I'm not alone - most of us who have been working with SGML/XML for a long time ran into these issues a long time ago and have been waiting for technical solutions that would help solve them.

Now that the technology is almost here, there are some potentially even tougher challenges:

1 - When you have the choice of storing some content in XML and some in relational form, it means you have to come up with good rationales for what content is stored how. Turns out that's hard sometimes. There's a lot to say on this topic and I'll return to it another day.

2 - Second, this means that the practice of content modeling as described by Dr. Glushko (actually, he calls it "document engineering," which I think is unhelpful, see the next point) needs to mature, and someone will need to "own" the design and maintenance of the relational/XML models because they are too linked to be truly separate anymore. How will that work??? Are the people who control databases ready to inherit responsibility for DTD/schema maintenance? Could the reverse happen? No on both counts - today. 

3 - Finally, we really need to stop using the word "document" for every XML instance we run into. Even Dr. Glushko used it for both stored XML documents - which presumably exist as an XML narrative in some permanent fashion and for some meaningful reason for the given system context - and for assembled documents that are created for more temporary purposes (presentation or delivery) by combining XML with data and other stored object types (like images). When discussing models, storage methods, and editing tools, it can be extremely confusing to use the generic "document" for both kinds of thing. We need more precise terms. Any suggestions?

Rich data buzz

We published our newsletter today on rich data.  Marianne Calilhanna wrote the lead article describing rich data and I wrote an article about web products that pull in tabular data in some interesting ways, as well as interviewed Mike Marchesano of VNU for our "Oh Really" interview series. 

Quite frankly, when Marianne and I first came upon the term "rich data" we were not sure what it meant.  And even after discussions with several people who used the term rather freely, we still couldn't put our arms around it exactly.  It seemed to describe something we already knew very well, so what was the deal with the buzz around it?   The term originates out of B2B media publishing.  But it describes something many publishers have done for years, especially those in STM and reference publishing, that is, reusing and mixing content to create new electronic products. 

But it's not totally fair to say rich data is just a new term for an old thing.  Because what would you call that old thing?  A "dynamic web site?"  That is a much broader term that implies something different (and feels so 1990's, doesn't it?).  In some ways, the label "rich data" better defines these types of products.  It moves the focus from the dynamic underpinnings of the web technology to the smart reuse, combination, and delivery of content.  Sure it's a buzzword and marketing-esque, but it is a term that focuses on the quality and use of the content more than the technologies that deliver it. 

Site Feed

About this Blog

This blog is produced by the consultants and analysts from Really Strategies, a content solutions and services provider.

A Content Management System for Publishers

Search This Blog

Lijit Search

Browse Archives

Browse a list of posts by author.