Learn XML Schemas and DTDs in 5 minutes

Posted by Christopher Hill on Mar 22, 2011 12:56:00 PM

In my previous blog post I introduced XML in 5 minutes. As a follow up, here's another 5 minute lesson to understand what an XML Schema or DTD is and what it might mean to end users of XML-based systems.
In the previous post we created an XML document to describe a book. Recall that it used tags around the actual content to describe the content.
<book>
     <title>Alice's Adventures In Wonderland</title>
     <author>Lewis Carroll</author>
     <summary>This book tells the story of an English girl, Alice, who drops down a rabbit hole and meets a colorful cast of characters in a fantastical world called Wonderland.</summary>
</book>

We also learned how representing content in this way allows us to dramatically reduce the effort required to support multichannel publishing. It also helps a great deal with automation and moving content between systems or organizations as it eliminates some of the issues of file formatting.
What would happen to our stylesheet if someone decided to use different tags to label their content?
<book>
     <title>Alice's Adventures in Wonderland</title>
     <writer>Lewis Carroll</writer>
     <summary>This book tells the story of an English girl, Alice, who drops down a rabbit hole and meets a colorful cast of characters in a fantastical world called Wonderland.</summary>
</book>

What we called author in one document we called writer in another. This inconsistency might be small now, but if we didn't restrict what people named things in our XML we would have to support a potentially endless list of tags. In the previous article we wrote rules for how to make our books look good on a page. If we can't predict what tags (labels) people are going to use - such as author - then it becomes nearly impossible to reliably write rules.
So even though XML helps us get a consistent base format for content, we need more help to get predictability and consistency.
Enter the concept of a DTD or Schema. DTDs and Schemas are ways that systems can impose rules on the XML itself. You can describe what tags can be used, where they can be used, and put restrictions on the content of those tags. There are two different standards for describing these restrictions: Document Type Definition (DTD) and XML Schemas. We won't get into the syntax or pros and cons of the two approaches. For our 5 minute lesson we can just assume they both are ways to enforce consistent labeling of our tags in our XML documents.
Here in English is how we might communicate the requirements for our flavor of XML:
  1. Put everything inside a book tag. You can only have one of these.
  2. The first thing you put in a book is a title tag containing the title text. You cannot leave this out.
  3. The second thing you put in a book is an author tag containing the author name. You must have at least one author. If there are more, you can repeatedly add more tagged authors.
  4. After all the tagged authors, you can add a summary tag. This is optional - leave it out if you want. But you can have at most 1 summary.
This is essentially what a DTD or XML Schema does, although they do this in a language friendlier to computers.
DTDs/XML Schemas allow you to specify the rules for the structure of your XML documents
You can think of XML Schemas or DTDs as a means to create a template that all valid documents must follow

These rules can now be applied to the two examples above. The first example follows the rules, so we would say that the first XML document is valid. That means it conforms to the rules. The second document, when tested with the above rules, would be invalid. The presence of tagged content labeled "writer" is not allowed by the rules. 
In the XML world, XML Schemas or DTDs are used in a lot of scenarios, including:
  • XML editors know what is allowed by the rules and prevent writers from making mistakes
  • XML programs test incoming content and indicate when the rules are being broken, preventing formatting errors
  • XML stylesheets can be much more easily written as they only process valid content and don't have to worry about rulebreakers
  • If I want to merge my book content with yours, we can look at the rules and decide what adjustments will need to be made to bring our rules together
  • Industries can agree on the rules for types of content. So we might create a set of rules to represent newspaper articles, adopt it as an industry standard, enabling anyone to easily exchange newspaper articles without having to modify the content.
So when you hear someone rambling on about an XML Schema or DTD, they are really just talking about the rules governing how the particular XML document is to be structured.
That's XML Schemas and DTDs in 5 minutes. In the coming weeks watch the blog for more quick lessons on XML-related technologies.

Topics: publishing, XML, XML Schema, DTD, 5-minute-series

Comment below