Learn XML in 5 Minutes!

Posted by Christopher Hill on Mar 15, 2011 8:24:00 PM

As product manager for an XML-based CMS, often I find myself answering the question "What is XML?" Not necessarily a simple thing to answer in passing - entire books and courses are dedicated to the subject - but in about 5 minutes most content creators can at least understand at a basic level what XML is all about. Here is my 5 minute XML class to give a first pass at answering this question. At the end most people should have a rough idea of what XML is and can start thinking about its potential.
   
Take a look at the following text:
      
Alice's Adventures In Wonderland
by Lewis Carroll
This book tells the story of an English girl, Alice, who drops down a rabbit hole and meets a colorful cast of characters in a fantastical world called Wonderland.
      
Reading it, you probably can infer that the underlined text is a book title. The italicized text appears to be the author of the book (if you dropped the word "by"). Then there is additional text. If you are familiar with the book, you can tell that the text is not part of the book itself, but appears to be a short summary. We know all of this because of certain conventions we have been taught as well as contextual clues provided by the words themselves.
      
Let's look at the same text again:
     
Lewis Carroll
Alice's Adventures In Wonderland
This book tells the story of an English girl, Alice, who drops down a rabbit hole and meets a colorful cast of characters in a fantastical world called Wonderland.

Even though the text looks quite different, we probably could still successfully interpret the author's intent. Now, imagine you have to train a robot to understand these two cases. What rules might you create to "teach" the robot how to decide how to successfully parse these two text examples and understand what parts of the text are titles, author names, and summaries? Even for these two limited cases, the list of rules is going to be quite involved, and will include rules based on formatting, language, and convention. And a robot will have to know that somewhere in the text are the three pieces of content, otherwise it probably has little chance of determining the nature of what it is looking at (Darwin the Jeopardy-playing supercomputer excluded).

Even if all that was true, errors are likely to creep in. Do your rules allow the robot to understand that "by" is really a formatting cue indicating the following text is an author name? What about the meaning of an underline? Bold? Italics? Each example uses the same formatting cues but in different ways. 

When you write a document in a word processor or a desktop publishing program you mostly focus on how that text looks. Imagine after all that careful work making the text look right for your printer, you decide your content should appear on a Web site. Suddenly all that formatting needs to be re-worked to take advantage of the conventions of the Web. Now if you want to deliver to an electronic reader or a mobile phone your formatting may again need to be revisited. 

This is an expensive proposition when you start imagining all of the possible delivery channels that may require formatting changes. For many channels, all that formatting work needs to be scrapped and done again. 

The answer to this is XML: the extensible markup language. A short way to summarize the promise of XML is that it allows authors to indicate what the content is rather than how the content looks. XML does this using tags, essentially text-based labels that indicate what a particular piece of data might be. Here is the same example as XML. Even though you may not know anything about XML, you can make sense of this content.

<book>
<title>Alice's Adventures In Wonderland</title>
<author>Lewis Carroll</author>
<summary>This book tells the story of an English girl, Alice, who drops down a rabbit hole and meets a colorful cast of characters in a fantastical world called Wonderland.</summary>
</book>

Yes, angle brackets seem a little cold, but notice how they are labels making the meaning of the various pieces of text completely unambiguous. Notice how each tag (i.e. <book>) has a match (i.e. </book>). These pairs create a container. Everything appearing between matching tags is considered contained by them.

Now, imagine you want to deliver the content to a printed page, web site or mobile device. It is much easier now to write a set of rules that indicate how each should be formatted.

Rule for <book>:
start a new page
Rule for <title>:
one the first line, output the text as underlined,16 point Helvetica
Rule for <author>:  
start a new line, output the text in italic, 12 point Helvetica
Rule for <summary>: 
double-space, start a new line, output the text in 12 point Helvetica

These rules could be called a stylesheet. In reality, stylesheets are written in a more machine-friendly format that are beyond the scope of a 5 minute XML course, but this example should suffice to give you an idea of what a stylesheet's role in creating presentable XML content is. 

If I had hundreds, thousands or millions of book content items I could use this single stylesheet to output them all and be guaranteed consistency. 

If I decide I want to switch to a new look for my content, maybe using Garamond instead of Helvetica, then I just need to modify the stylesheet. Notice how the original book content exists completely independently of the formatting. And notice that changing the look of something doesn't require any editing of the original content.

Suppose I want to deliver my content to multiple channels. Each channel has its own conventions, limitations, and capabilities. People expect different formatting on a mobile phone than on their desktop computer. A web site typically looks different than the printed page. We used to underline book titles in print - but underline means something else on the web so often a different format is used. Helvetica may not even be an available font for many channels. 

Fortunately, XML makes supporting each unique channel straightforward. Instead of endlessly modifying my content, I simply develop new rules (a new stylesheet, or a variation of an existing stylesheet) for a new channel. The authoring process is unaffected. Existing content does not need to be re-edited to support the new channel. Instead, the new rules are applied to the content generating channel-appropriate output. And as the number of channels expand, XML serves as a content anchor point so that you can adapt content rapidly and in an automated way to the unique requirements of each channel.
XML as an anchor for multichannel publishing

If all of your content were clearly labeled by what it is, it is not nearly as daunting to support new output formats for new opportunities. Instead to a massive conversion of all your formatted content from one format to another, you just have to develop a new set of rules and you are ready to go.

That's XML in 5 minutes. At least as it impacts content creation and delivery. In the coming weeks I'll be posting additional quick lessons on XML-related technologies.

Topics: ebooks, publishing, XML, 5-minute-series

Comment below