In preparing for an upcoming newsletter, I interviewed Bob Glushko, Adjunct Professor at the University of California at Berkeley in the School of Information, on concepts covered in his Document Engineering book. We've had some other pressing priorities and it doesn't look like we will publish the newsletter on this topic in the near future. But I have the interview and we have this blog. I'll publish this in three parts to make it a bit more manageable to read.
Here is part 1 of the interview:
Ed: We in the traditional publishing world tend to think of a "document" as a textual piece of information (as opposed to fielded data), but you apply a broader definition for the purposes of the document engineering discipline defined in your book. How do you define "document?"
Bob: I define document as “a purposeful and self-contained collection of information.” That’s a technology-neutral definition that doesn’t impose a sharp boundary between unstructured or semi-structured information published for people and structured sets of information primarily used by computers. Documents organize the interactions between people, between enterprises and their customers or clients, or between applications and services. If we think of documents as the input requirements and as the output results from many kinds of processes, the concept of a document seems surprisingly stable and transcends the profound changes that technology has made in document encoding and exchange.
Some people object to classifying small pieces of information as documents. They want to distinguish between fine-grained, structured data and coarse-grained unstructured documents or use the latter term only where they can imagine something printed. But that’s why “self-contained” is part of my definition. A single data element – a price or a quantity, for example -- needs some context for its interpretation. That context can be conveyed by additional information in the form of metadata or text that accompanies or contains the data element… and that combination of information makes it a document by my definition.
Ed: We (and again I am speaking about the traditional publishing industry) tend to approach content modeling by analyzing the components of a document and then creating a DTD or Schema to represent the model. How does your approach of document engineering differ?
Bob: Document engineering looks abstractly at the continuum that connects documents and data, so it blurs the traditional distinction between “document analysis” and “data modeling” and emphasizes what they have in common rather than what they do differently. Document engineering applied to publishing is a little different than document engineering for transactional documents, but they have a lot more in common than most people realize, so before I talk about what’s different let me explain what’s the same.
For both publications and transactional documents we start by studying instances and other information sources to identify and pull apart semantic, structural, and presentational components. Then, for both types of documents we refine the components to make them “good” ones with more explicit or reusable semantics and structure. We then use these component models as the building blocks for hierarchical document models that enforce the semantic, structural, and presentational rules for some particular context. Finally we encode the assembled document models in some concrete syntax, most often as an XML Schema or DTD. The semantic and structural rules go into the schema, and the presentational rules go into style sheets or transforms.
In content modeling for publishing, semantics are usually weak (most content is “text”), so structural and presentational components dominate (and are usually correlated). So it is harder to define a library of reusable content components than it is for more data-intensive types of documents, which tend to have more explicit semantics, strong datatyping, and more arbitrary presentation. And because transactional instances are more homogeneous than publication instances, it is more important to analyze a much greater variety of information sources for transactional models -- database tables, spreadsheets, accounting systems, printed or web forms, and descriptions of application program interfaces (including the code that implements them). The contrast between publishing and transactions is also important when the model is encoded, because the stronger constraints for transactions require XML Schema when DTDs may suffice for publication types.
Check back tomorrow for part 2.
Comments