Bob Glushko Interview, Part 2: The Document/Data Divide
This is part 2 of 3 in my interview with Bob Glushko on his principals of document engineering. Read part 1 here.
Ed: Your approach for document engineering includes a mixing of traditional document and data analysis processes. What can people who live solely in either of these camps learn from the other?
Bob: I don’t agree with the premise that people live solely in one camp or the other. Publications and transactional document types usually coexist and are often closely related, either by structural transformation or by business processes. Think of tax forms and the instructions for filling them out, or product brochures and purchase orders. For information to mean the same thing or flow efficiently from one type of document to another in sets of related documents, there must be common content components that are reused. This implies an analysis and design methodology that can deal with both kinds of document types at the same time in the same way.
Off the soapbox – I’ll assume for the moment that there are people who THINK they live in one camp or the other. They both benefit from applying analysis and modeling techniques from the other camp. I’d tell the document analysts that they should try to apply some of the more formal and rigorous techniques of data modeling – like normalization – because they will become more systematic at identifying repeating or recurring structures, removing redundancies and technology constraints, and creating a more concise and reusable representation of their information components. This would make it more likely that two document analysts would look at the same domain and information sources and end up with interoperable models.
Similarly, I’d tell the data modelers to lighten up a little. At times the more flexible and heuristic approaches used by document analysts, which evolved because the document world isn’t as regular and homogeneous as the data world, are good enough. Lots of data modelers create fully normalized models that are unwieldy in practice because there are too many components. Document analysts pay more attention to how authors and users interact with the content and are less fixated on formal methods, which enables them to do “heuristic normalization” whenever they design the reusable boilerplate components for a set of publications.
Ed: Would you modify your document engineering approach for use in the publishing industry, where the content being analyzed actually is the product, not in support of product sales?
Bob: Document engineering IS already designed for use in the publishing industry because it reflects my own experience in SGML electronic publishing and XML-driven B2B commerce. The analysts and consultants designing the information component libraries and document types in both companies were doing most of the same things, and after the Internet bubble burst I “retired” to UC Berkeley to reflect upon, systematize, and teach about that.
Now I’m not saying that document engineering for publishing and document engineering for transactional documents are exactly the same. But it completely depends on the nature of the documents in your domain. Moby Dick and an invoice are very different and can’t be analyzed and modeled exactly the same way. But is a catalog a publication or a transactional document? In the middle with these hybrid cases there just isn’t a clear distinction between document analysis and data modeling – it’s all document engineering.
Ed: So that’s why you've described the data/document divide as a spectrum. How do the types of content that live in the middle of that spectrum pose the toughest modeling challenges?
Bob: One of the unifying ideas in document engineering is viewing both documents and data on a continuum we call the Document Type Spectrum by analogy with the continuous rainbow formed by the visible light spectrum. It is easy to contrast highly narrative style documents from those that are highly transactionally oriented, just as it is easy to distinguish red from blue. But it can be difficult to distinguish different shades of a single color.
These difficult distinctions arise in the middle of the Document Type Spectrum where documents contain both narrative and transactional features. This is where we find hybrid documents like catalogs, encyclopedias, and requests for quotes. A catalog is often modeled with strongly-typed metadata wrapped around weakly-typed or mixed content text descriptions. Encyclopedias, RFQs, and similar types of documents can be modeled as the inverse – text wrapped around islands of strongly typed content. These islands might be timelines or charts or tables of various kinds for the former, or detailed specifications for the latter.
The biggest modeling challenge for these hybrid documents is knowing how granular to make the models. We all know that what often looks like unstructured text is really a “mixed content” model that contains emphasized words, glossary terms, references to tables or figures, citations to supporting documents, links to footnotes or endnotes, on and on. We might be able to eliminate the “PCDATA” altogether, or at least create a repertoire of “inline” elements so that these semantic distinctions can be encoded. But there is always a tradeoff between who does the work when a document is created and who gets the benefit when it is used, and this tradeoff is hardest to compute when the users are an unknown mix of people and automated processes.
Check in tomorrow for the final segment of this interview.


Comments