Quick thought on XBRL

I was just reading Mark Logic CEO Dave Kellogg's post on XBRL and Microsoft's claim to be the first company to use it.  After taking a look at the filing from Dave's blog link, it struck me that:

- The poor pilot companies that are outputting XBRL - presumably by hand!  This is clearly no language to be editing or even converting without software/system support.

- Wouldn't it be great to have a publishing oriented XML CMS like RSuite to help author, edit, and pull required information together! 


Word 2007 add-in for NLM DTD

Something to keep an eye on for STM publishers . . .

Microsoft released a add-in for Word 2007—called the Article Authoring add-in—that assists in creating XML in the NLM DTD.  You can also import a NLM DTD XML document and load it into Word. It seems to be targeted to the authoring point of view, that is, the add-in is designed for the authoring stage, as opposed to the production/editing stage, where  more common (and robust) Word-based tools, like Inera eXtyles, live.  I assume it could also fit into the editing stage, which seems a much more likely place for this thing to happen.

But before you jump in too quickly, realize that even Microsoft warns that the add-in is a beta and not production ready.  And Inera has some cautionary words about Word 2007.

Other links:

  • Download the plug-in  
  • A blog from Pablo Fernicola, the Microsoft product manager, overseeing the project.
  • A video demo of the add-in (from the blog)
  • I'll be bookmarking anything I see on my del.icio.us links

DITA For Publishing: DITA Project Gutenberg Samples

As a side effect of the new DITA2InDesign project, I have started converting more or less random publications from Project Gutenberg into DITA as way to both provide some non-trivial, non-technical-document samples in DITA as well as to demonstrate different approaches to using specific DITA features for specific kinds of content.

The source for the samples is in the DITA2InDesign source code repository on SourceForge. The HTML and PDF renderings from the DITA XML source are served from the DITA2InDesign project Web site: DITA Project Gutenberg Samples. These have been rendered using the out-of-the-box DITA Open Toolkit HTML and PDF2 processors (although the PDF2 processor has been customized to use different fonts from the default Arial).

Once the DITA2InDesign process is working these documents will serve as test cases for that process as well, acting as test cases that are representative in terms of size and content charateristics of what modern publications of similar types would be like when managed as DITA-based XML content.

All the Project Gutenberg documents are either in the public domain or were donated by the copyright owners to Project Gutenberg. If anyone reading this post has a publication that they think would be an interesting candidate for DITA representation, and would be willing to donate the source to the DITA2InDesign project for non-commercial use (that is, the donor can retain the copyright and impose any derivative use restrictions they want as long as the material is licensed for viewing and non-commercial use in its DITA form) then I will happily convert the document to DITA. As for the Gutenberg samples, I can't promise an optimal conversion but I can promise a complete and correct conversion. [Note what I'm offering here: essentially free consulting for the price of giving away access rights (but not ownership) to one publication. Of course, this offer is on a first-come, first-served, time-available, while-supplies-last basis.]

Some things that could be done fairly easily with these DITA documents but that are not currently provided for in off-the-shelf tools include:

  • Generating eBooks in various standard and proprietary formats (OEBPS, Sony Reader, Mobipocket, etc.)
  • Generating digital talking books in NIMAS format
  • Generating Web deliverables tailored for mobile delivery
  • Generating a Wiki-style interactive Web site from the DITA source

In addition, this source is all ripe for additional metadata classification. For example, the entries in the Encyclopaedia Britannica sample should all have explicit subject keywords as part of the topics' metadata.

The DITA Project Gutenberg samples have the same unrestricted use licenses as the original data on the Project Gutenberg site, so feel free to use these samples for whatever you want. In particular, these make useful test and demonstration data sets for DITA-aware products.

Enjoy.

Live DITA Application: FASB U.S. GAAP Codification

The work of all accountants doing commercial accounting in the U.S. is governed by the Generally Accepted Accounting Principles (GAAP), created and maintained by the Financial Accounting Standards Board, a member-supported organization mandated by the U.S. Congress.

Historically the GAAP has been created as a mishmash of different documents and supporting interpretation and commentary. There was no single organizing schema or source. In short, it was essentially impossible to determine whether or not you had found everything relevant to a given accounting issue.

To address this problem, the FASB decided to create a new all-encompassing classification taxonomy for the GAAP and codify all existing GAAP standards under this taxonomy. This project has been going on for over four years and has resulted in the Accounting Standards Codification, or ASC. The ASC content is currently undergoing an extended period of public review and is available through the FASB ASC Web site: http://asc.fasb.org/home.

While the ASC taxonomy itself was a major achievement, the codification activity was a daunting editorial process in which all the existing standards content had to be re-authored in a new form that directly reflects the taxonomy. To support this activity the FASB decided to use an XML-based system, which should come as no surprise.

But beyond that, the FASB realized several important things:

  • The GAAP content is highly modular
  • The GAAP content can be organized in many different useful ways depending on how it is being used:
    • By subject
    • By industry
    • By business process
    • By what's of immediate interest to a particular person researching a problem or set of problems.
  • The GAAP content requires rich metadata to enable accurate search and retrieval as well as binding to the new ASC taxonomy
  • Licensees of the content will want the XML source and will want to be able to use it with as little effort and expense as possible
  • The FASB does not have huge budgets for XML application development and implementation yet needs non-trivial systems for authoring and managing the GAAP content through its editorial processes as well as for delivery through the authoritative FASB Web site.

Given the foregoing, the FASB realized that a more traditional XML application, while possible, would not necessarily be optimal and would likely be prohibitively expensive and would not meet the requirements of licensees for ease-of-use of the XML content.

However, a DITA-based application would satisfy all these requirements. David Prather at FASB realized that the GAAP content could be modeled quite handily using DITA with some GAAP-specific specializations.

David worked out a clever way to use DITA maps to manage the organization and packaging of the codified GAAP content and hired me to design and implement the necessary GAAP-specific specializations (as well as do the data conversion from an initial XML format they had used for the initial codification editorial work). The FASB selected Ovitas to implement a new editorial support CMS system as well as the dynamic delivery system used to serve the ASC content through the FASB Web site.

The project went remarkably quickly--we had working DITA specializations defined and in place in a matter of weeks and the models required only minor refinement as the system implementation progressed, mostly stemming from new understandings of the underlying content as the codification editorial process approached completion. The CMS and Web site implementation went equally smoothly (remarkably so in my experience building such systems).

Because we could use the free DITA Open Toolkit to generate HTML sufficient for internal review of the codified content we didn't need to invest any time or money in acquiring or building rendering support just to support internal Q/A of the DITA content, a significant savings. Essentially, it allowed one part-time consultant, me, to do what would in the past have required a team of three or four consultants months of work to implement. By the same token, we were able to use the off-the-shelf DITA support in XML editors like Arbortext Editor and OxygenXML, removing the need to invest in document-type specific editor configurations and customizations, again saving weeks or months of consultant time. I think I spent about two days coming up to speed on how to configure Arbortext Editor to work with specialized DITA document types and about 1/2 day creating the necessary configurations (it's essentially a copy and modify process that I can now do in minutes).

Likewise, the Toolkit means that licensees can do *something* with the ASC content immediately, as well as giving them a solid base from which to develop whatever internal processes they need. Large publishers with existing XML infrastructure can of course apply that, but smaller publishers with little or no XML infrastructure can still take immediate advantage of the ASC XML source.

The ASC content is currently undergoing an extended period of public review and is available through the FASB ASC Web site: http://asc.fasb.org/home. The content is served dynamically from a slightly sanitized version of the DITA source--it is not static HTML pages generated from the DITA source.

The FASB ASC application is a working example of how the unique features of DITA XML applications significantly lower the cost of building this type of system while enabling significant value for the DITA-based content itself.

One interesting side effect of this system is that most, if not all, of the FASB's licensees, which include all the big name publishers and many smaller ones, will end up with both DITA-supporting internal systems as well as internal DITA expertise that can then be quickly and easily applied to any other DITA-based content, regardless of its markup details or subject domain. That seems pretty interesting to me....

DITA Viewed as Domain-Specific Language

Eric Armstrong at Sun Microsystems has written an interesting blog post about the power of domain-specific languages (by which he means programming languages and data modeling languages), citing Ruby and DITA as two current examples: http://blogs.sun.com/coolstuff/entry/domain_specific_languages.

One of his main points is that part of the power of domain-specific languages is that they tend to foster a supporting infrastructure that makes using them significantly less expensive than more general solutions.

This is certainly true for DITA.  If you've ever had the experience of designing and building a non-trivial XML application from scratch, you know how much easier it is to build on DITA, partly because of its inherent technical design, but also because of the amount of support infrastructure that is available that, more or less, just works.

XML2007 Day 1 (publishing track)

Here are my notes from XML2007 in Boston, Day 1. Since I'm responsible for the publishing track scheduling, I'm hanging out in here all conference.

******

Opening Plenary - Does XML Have a Future on the Web? The big takeaway from this panel was that developers in the real world are still being confronted with the fact that a lot of data can be modeled very effectively relationally and as objects, and using XML for such data imposes some unwelcome complexity, especially in terms of how to map the data to structures available in programming languages. JSON provides an easier way for such developers to work with content. On the other hand, some content doesn't fit well into this model, and so XML's complexity provides value well worth the cost. It was fun to see Michael Sperberg-McQueen and Douglas Crawford "discuss" this divide, but most audience members found value in both perspectives, and didn't seem to take seriously Crawford's notion that XML has been outright dangerous because in part of its being a distraction from the evolution of other web standards like HTML and javascript. The third panelist, Michael Day, offered some practical perspective from the viewpoint of a software provider that needs to wrestle with all the ways in which information might be published on the web. As he said, he saw no reason to privilege one format (HTML, XML, JSON) over another. He also made a comment about thinking CSS could potentially be used for sophisticated print formatting not only for the web. I'd like to hear more about that.

******

Eric Severson from Flatirons spoke about practical DITA lessons. (1) It's harder to model and actually develop content for re-use than you might think. Shouldn't be an all-or-nothing approach - re-use where is a lot of benefit, don't force-fit the re-use in other cases. Reconciliation also doesn't all need to happen at the time of DITA adoption. (2) How deal with approval process when no longer working with publications - working with topics? Create a map that is specifically designed to facilitate review - includes enough context for review but probably not the same as the publication. (3) Use specialization only when absolutely necessary - tools don't yet have much support for specialization. DITA committee is trying to address some inflexibility in generic task model that means people end up specializing when they might not want to. If need to specialize, specialize from the standard types as much as possible (task is exception because of other issue). Domain specilization is also an area where specilization is very often justified (for keyword typing, for example). (4) Use conref (enables re-use of content objects inside a topic) only in cases where want to create an index or list of things - for true re-use, can become very limiting to authors - hard to write text for so many contexts. Avoid nested topics for similar reasons. Maps and nested maps provide a better way to do this - impose less overhead on the topics themselves. (5) Dynamic content delivery - Important benefit of DITA is the metadata on topics/maps that indicate audience and other information. Rather than build a static publication on a topic, allow users to leverage that metadata in search.

******

Matt Turner from Mark Logic talked about Office Open XML (the XML underneath Office 2007). Quote: Office Open XML is cool because it's XML and you can mess with it. The new Office writes XML natively - no other format living in between the applications and XML. This is not the previous XML formats - this is new. Spec is huge and complicated and hard to read because of need for backwards compatibility and need for performance (one letter elements). Spec defines a zip package of a document's data (XML) and other items (images etc). The individual XML items in the zip are easy to interpret. The Office apps are now OOXML editors - and other (non-XML) applications could be OOXML editors also. Can bind a control in document (like for a form) to an XML instance inside the document or dynamically retreived. Have successfully generated OOXML from other data sources (showed demo with Shakespeare's plays). Demonstrated structural editing inside Word. The Office ribbon (replacement for toolbars/menus) can be configured (with XML) and customized to provide the kind of editing tools that are desired, including interaction with other server applications. I can't possibly describe Matt's mashup of tech docs and As you Like It, but it was both information and amusing. Thanks, Matt.

******

XML Authoring Tools panel with Justsystems (XMetaL), Adobe, Xopus, moderated by our own Mark Jacobson. XMetaL: Sweet spot is for direct creation of XML technical documentation. Vision of company is to be able to use other Justsystems product (xfy) to enable environments in which content from multiple schemas can be mashed together and where documents can include application logic. Adobe: Concept is to enable tools that address cross-media workflows whether that's simple XML docs using the XML tagging features in Creative Suite or whether it's focusing on layout and automating the generation of the XML later. Also want to get to re-purposing and to support for authoring based on rules or scenarios. For future thinking about schema language including RELAX NG. Xopus: Focused purely on XML editing by non-technical users in a web browser. Discussion: Mark pointed out two contexts for discussion - ubiquity of Word and expectations that brings, and the fact that people don't like to edit inside a structure.

******

Bob DuCharme - XHTML 2 for Publishers: New opportunities for storing interoperable content and metadata. 1.0 was about separation of design and content. 1.1 was about modularization rather than features. 2.0 goals: encode more semantics, more device independence, better forms (XForms), less scripting (XML Events). (Note, not a W3C recommendation yet.) The first two of these drive Bob's contention that XHTML 2 can now be used for publisher content - probably not as the primary source, but for more than just as browser format. Obvious example: interchange among organizations. Why consider it: DTDs can be really complicated - overwhelming and intimidating. HMTL is familiar and simpler. How does this work? STRUCTURE: HTML has a <section> element for grouping. The <h> element represents a heading regardless of level. So, have structure, can change levels without re-tagging. BETTER SEMANTICS/STRUCTURE: separator rather than hr. pre can be embedded inside p, which means can do things like present a single paragraph across multiple lines. lists can be embedded inside paragraphs so relationships are clearer. p as img - show image if available (or depending on device), otherwise show the paragraph text. Use of role attribute that points to namespace rather than class attribute (similar to DocBook). (Use of class for semantics can interfere with use of class for stylesheets, plus class is supposed to be nmtoken.) METADATA: Can use RDFa to embed your own metadata. The predicate and value go on elements that represent the subject (or that are contained in the subject object). So, maybe this:

<section><span property="dc:subject" content="recipe"/>...</section>

Can also do this:

<meta about="http://mynamespace" property="dc:subject" content="recipe"/>

Or even put an id on a content object (like a section) and point to it from meta tags.

But, as one attendee pointed out, HTML 5 is on a separate path than XHTML 2, and isn't at all clear that XHTML will get much support from browser vendors. Regardless, it appears to be a simple, familiar, but reasonably powerful way of sharing documents even if there is never any expectation of viewing them directly in web browsers.

******

Eric Clark of Time and Lee Vetten of McGraw Hill reviewed what's new in PRISM 2.0:

  • Addition of elements to reflect more complicated workflow (sometimes web-first, sometimes print first) – original platform, web channel, killdate, postdate.
  • Support schema as well as DTD
  • Profiles: XML only profile, rdf/XML profile, also XMP profile now (especially for PDF archives)
  • Updated controlled vocabularies
  • Added aggregation type, genre, and presentation rather than just the previous “category”
  • Added roles for creator and contributor
  • Added a bunch of other elements, including some inline elements
  • Eliminated PRISM 1.0 elements that were redundant to Dublin Core elements

Note: 2.0 is not backwards compatible.

Future work: Subcommittee around rights management (tracking and handling of digital assets). Creating a cookbook document that will help implementors understand how to support some standard use cases. Will roll out via webinar in January.

2.0 docs available now but not 100% complete. Final posting expected in December.

******

Jens Erlandsen spoke about a Swedish Dictionary project for the Swedish Academy (the ones that give the Nobel prize). Dictionary is modeled after the OED - massive scale. Been working on the first edition since 1898. Expect to complete in a few more years. 200 million characters, XML= ~600MB. Happy with manual workflow - working with slips of paper that can sort and look at more usefully than if digital. Dictionaries are special - average number of characters per element is about 7 for this dictionary. Dense, highly marked up, high quality content. Lots of element types - hundreds. The editorial rules are complex and unstable. And some rules can't be reflected in XML - homographs, sorting rules, etc. So how to build a schema? What to leave out? Jens' main theses/questions: (1) A schema can't be developed outside the context of how it will be used and what tools will be used. (2) Can one schema support all needs? No - different parts of the process need different schemas. (3) What is needed beyond schemas to capture all rules? Something - what? Jens covered the approach in detail. This was a great illustration of one of the points from the opening session dictionary content cannot be represented with name/value pairs. Jens also drove home that an authoring schema can't be designed without lots of experimentation to see what it's like to actually use them as an author/editor - eg to allow users to author flat content and add structure later, to re-organize entries, and so on. He mentioned that they used the iLEX tools for their project, which seem to be pretty cool.

******

Great day.

DITA: It should just work

If you haven't heard of the DITA standard (Darwin Information Typing Architecture) you should have. DITA is emerging as one of the most important XML-based standards for documentation and publishing to come along in a long time. In a nutshell, DITA provides a solid, extensible, flexible architecture for creating, managing, and publishing documents (in the "information for consumption by human readers" sense) where the document content is managed as sets of small modules ("topics") that can be quickly and easily recombined into different delivery packages using a simple hyperlinking mechanism ("maps").

The basic idea with DITA is that you author your content as individual topics, where each topic is a (more or less) standalone chunk of information that can then be combined with other topics to produce a complete work. Obvious examples are encyclopedias, online help, and so on, publications where the information naturally organizes into individual modules. However, DITA can be applied to a much wider variety of content types, although it may not be appropriate for all types of content.

A key feature of DITA is the specialization mechanism. The DITA standard as published defines a set of core document types that are useful but very generic. However, DITA's specialization feature lets you define new document types that are formally derived from the base types such that any DITA-aware processor can process documents using the new document types as though they used the base types.

In addition, DITA defines a simple syntactic mechanism (the DITAArchVersion attribute) that allows a processor to reliably determine that a given document is in fact a DITA-based document, regardless of what DOCTYPE or schema it uses. That is, DITA documents are self describing as DITA.

These two aspects of DITA, specialization and self description, are very important because they mean that, regardless of the markup details of any DITA document, a DITA-aware processor can always correctly and reliably apply its base DITA processing to those documents.

For RSuite, this means that when RSuite is presented with any DITA document or document type that it's never seen before, it should just work: it should be able to load the DTD or schema and configure it automatically and then load the DITA-based documents and apply to them whatever DITA-specific features RSuite provides, such as treating DITA maps as content assemblies, providing DITA-specific searches, and enabling publication and processing of DITA content using DITA-aware tools such as the open-source DITA Open Toolkit.

That is, for DITA, RSuite should just work.

As an integrator myself that's what I want: Not only should I not have to go to any extra effort to get DITA-based stuff into RSuite, I should be able to expend less effort because I've used DITA.

As a user I should be able to just bring DITA-based stuff into RSuite and have it work with a minimum of effort.

As an engineer building functionality for RSuite I want to provide the smoothest, easiest, most productive user experience I can and DITA allows me to do it.

This ability of DITA to enable this level of convenience and automation has led me to champion this manifesto:

When it comes to DITA, it should just work.

This is my manifesto as a user and integrator of DITA-aware tools: I expect them all to just work, taking advantage of DITA's self descriptive nature to automatically apply specialization-aware processing to DITA content and document types, whatever that means for a specific tool (e.g., automatically using default DITA style sheets and editor customizations in an editor).

This is my manifesto as a provider of DITA-aware tools to you: it should just work. If, as a tool provider, I claim DITA support and it doesn't just work, then I have failed as an engineer.

RSuite's official product plan includes significant support for DITA in the near future. I have started developing a technology demonstration that shows how that support can and should work. If you are interested in having DITA support in RSuite we would be happy to demonstrate what we have working today and talk about what we see as key features and what you see as key requirements for DITA support.

What I have working today is:

  • The ability to take any valid DITA document type, specialized in any way (as long as it conforms to the DITA 1.1 architecture specification), and load it into RSuite as a one-step process such that the DTD is automatically configured for use within RSuite (in particular, all the appropriate default managed objects are defined and configured based on their base DITA types).
  • The ability to take any valid DITA map and all of the topics it links to, specialized in any way, and import them into RSuite as a one-step process, resulting in a new RSuite Content Assembly that reflects the original DITA map completely (including markup details such as the specific element types used on the map and topicref elements).
  • The ability to export any Content Assembly as a DITA map that can then be processed by any normal DITA processor (e.g., the DITA Open Toolkit). If the Content Assembly was created by importing a DITA map, the original markup details will be reflected on export (e.g., if you imported a DITA bookmap map you'll get back a bookmap map on export).
  • The ability, using RSuite's generic content assembly manipulation features, to modify an existing map or create a new map.

Even in its current crude, demonstration-purposes-only state, this is pretty significant functionality, functionality that a lot of existing DITA-supporting CMS systems cannot or do not provide.

Obviously we at Really Strategies have a lot of work to do to translate this technology demonstration into production-ready software but there's no particular technical barrier to doing so, it's just workaday engineering to account for all the details. A lot of the work is user interface work--the underlying functionality of RSuite largely does what we need for DITA or can be quickly extended (for example, by providing some additional XQuery DITA-support convenience functions). A lot of the work will be integrating existing and enabling potential back-end processors with RSuite's generic Workflow system so you'll be able to quickly and easily build workflows that send your DITA-based publications out through different DITA-aware processors, including the DITA Open Toolkit, MarkLogic as a DITA-aware dynamic delivery system, and so on.

This project is personally exciting for me: I've been working with DITA for a long time and have built up a largely unrequited set of desires for the functionality a DITA-aware system should provide. Now that I have the opportunity to finally requite these desires my head is literally buzzing with ideas for features and functions that RSuite could provide to make working with DITA-based content as smooth, easy and productive as it can possibly be.

In short, my personal goal is to make RSuite be the DITA supporting system that I've always wanted to have. And you can be sure that if it satisfies me it will almost certainly satisfy you. And if it doesn't, I want you to call me on the phone and tell me how it doesn't so I can get things fixed ASAP.

Stay tuned....

PRISM 2.0 open for public comments

The Publishing Requirements for Industry Standard Metadata (PRISM) working group has released the PRISM 2.0 specification for public comment.

From the press release,

This major revision of PRISM addresses the new requirements for publishers and media companies to deliver content in an online multimedia environment, as well as in print.  According to Lee Vetten, McGraw-Hill Business Information Group‘s Co-Chair of the PRISM Working Group, “PRISM 2.0 heralds a new generation for PRISM. Today’s magazine publishers have made a dramatic shift to delivering eMedia-based content online as well as traditional print magazines. The development of PRISM 2.0 reflects the commitment of the PRISM Working Group to mirror today’s new publishing models in the specification.” 

Dianne Kennedy, IDEAlliance Vice President of Publishing Technologies comments, "Based on a series of focus groups conducted during 2006, we have undertaken an aggressive update of the PRISM Specification to address content that, for the first time, appears online before it is cast in print.

Visit the PRISM web site for more information, to sign up for a webinar, and to download a copy of the 2.0 specification.

Get XML from InDesign & InCopy today

There has been a lot of activity in the XML-export-from-Adobe-area here at Really Strategies over the last year.  Over this time, we have developed a system with continually deepening capabilities for extracting XML from Adobe InDesign and InCopy.  The projects have been across several publishing verticals and have produced content to meet several XML standards, for text newsletters, and for the web.  The projects that I have worked on all manage content in the K4 Publishing System, and have used the K4 XML-Exporter aided by Adobe API scripting.

Here are a few thoughts that come out of this effort:

  1. There is no significant technical barrier to getting XML from Adobe InDesign & InCopy today.  However, it does require clients to put away the notion of the magic export button. 
  2. The major limiting factor is the practical limits of standardization required of print production staffs (in conjunction with support by web or text processing production staffs).  However, this limit is continually being reduced through our ongoing innovation efforts.
  3. These projects inherently require process change.  Where text is already being manually processed after files go to printer, this is not very hard to achieve.  It may be more difficult for clients to envision and implement changes where no post processing currently exists. 
  4. Because Adobe doesn't inherently require file usage standardization in InDesign and InCopy, standardization must be built into the use of the tools themselves - meaning print staff are required to modify their use of the tools - though minimally in most cases.   
  5. Projects may also require some semi-automated application of metadata and linking information prior to triggering an export.
  6. The K4 XML-Exporter (an add-on to the K4 Publishing System) is a great tool for this process, primarily because it adds some workflow automation as well as automated access to a good XSL processor.
  7. JavaScript is now a standard element in these projects, as it can powerfully minimize any necessary changes for the print staff.
  8. Integration with Web CMSes, by the way, is an achievable reality. 

We're excited as we continually press forward - its a great time when there is innovation every day. 

Overview of namespaces and W3C XML Schema

Another summary I wrote for other purposes because I couldn't find it on the Web. This stuff is hard to write about. Feel free to correct me if you see something off (in fact, please do).

***

INTRODUCTION

Namespaces allow you to combine elements from multiple domains (i.e., namespaces) in the same XML instance. Even if the instances you encounter don't combine multiple namespaces, you still have to understand namespaces to work with XML that uses W3C XML Schema.

First, some terminology.

    <MYNS:document xmlns:MYNS="http://www.mydomain.com">

  1. http://www.mydomain.com" is the namespace.
  2. "MYNS" is the namespace prefix.
  3. mlns:MYNS="http://www.mydomain.com" is the namespace declaration (an attribute and its declared value).

All XML elements belong to a namespace domain. Often you can guess what this is by looking at the element's prefix. Elements without a prefix are part of the default namespace for the instance. You can't guess what the default namespace for an instance is - you have to look for the default namespace declaration. If there is no default namespace declaration, then there is a namespace - it's just null. (Null is not the same thing as not existing.)

When you set the the default namespace, it applies to the current element and all its descendents. However, you can switch the default namespace on a descendent node by resetting the xmlns attribute to a new value. I've never seen this done in the real world.

If you aren't aware that the namespace can be defaulted, writing namespace-aware software (all schema-aware software is namespace-aware) causes confusing problems. For example, if an instance has the default namespace of "http://mydomain.com", then you need to make sure your code's references to its elements also specify that namespace. In XSLT, this means assigning a namespace prefix to the domain, and referencing the element with the prefix included. So, your code might reference <MYNS:document> even though the instance contains <document>.

The same prefix could be used in two different instances but for two different domains. Again, this can cause issues for software processing if you're not careful. If two different instances contain <MYNS:document> elements, but in one instance MYNS is mapped to http://www.mydomain.com and in the other to http://www.yourdomain.com, then those elements are NOT the same element, and software must be written accordingly. In XSLT, for example, you would declare both the namespaces, but use two different prefixes to reference the domains used in the two instances. So, your software might reference <YOURNS:document> in order to manipulate elements that look like <MYNS:document> in one of the instances.

DECLARING NAMESPACES

Namespaces are declared through attributes with the xmlns prefix. (Note that W3C XML Schemas and XSLT scripts are XML instances, so this applies to them as well.)

Most of the time, you will encounter namespace declarations like this:

    <MYNS:document xmlns:MYNS="http://www.mydomain.com">

        <MYNS:para>Some text</MYNS:para>

    </MYNS:document>

The prefix you want to use is given the xmlns prefix and assigned to the namespace domain. The namespace declaration is included in the instance either on the current element or one of its ancestors.

To set the namespace default, you do the same thing, but the prefix you are assigning is the null one (xmlns is a prefix to nothing!):

    <document xmlns="http://www.mydomain.com">

         <para>Some text</para>

    </document>

In this instance, any element without a prefix is part of the http://www.mydomain.com namespace.

A namespace prefix only needs to be declared once in an instance that uses it. An instance is not well-formed if a namespace prefix is used with no declaration, or if the declaration is not on the current or an ancestral element.

ATTRIBUTES AND NAMESPACES

Attributes can have a namespace prefix just like elements can. But, attributes don’t follow the same rules as elements regarding namespace declarations. If you set a default namespace for an instance, you must still explicitly declare the prefix for the instance's attributes. For example, in this:

<myElement xmlns="http://www.mydomain.com" myAttribute="1234" /> myElement is part of the mydomain namespace, but myAttribute has the default (null) namespace. To make myAttribute part of mydomain, you would need to do this instead:

    <myElement xmlns="http://www.mydomain.com"
      xmlns:MYNS="http://www.mydomain.com"
      MYNS:myAttribute="1234" />

This makes sense if you think it through (couldn't really work any other way).

REFERENCING SCHEMAS

There are two ways to reference schemas. Both use an attribute to point to the schema location. The attributes belongs to a reserved w3c namespace domain, which is almost always referenced with the "xsi" prefix.

1. THE xsi:schemaLocation ATTRIBUTE

Schemas can (optionally) include what is called a target namespace. It is declared using the targetNamespace attribute on the root <schema> element. This is the namespace for elements defined in the schema. If an XML instance references a schema with a declared target namespace, then its elements must belong to the same namespace.

When a target namespace is declared in the schema, instances must use the schemaLocation attribute to reference the schema, and the default namespace for the instance must be set to the schema target namespace.

    <document xmlns='http://www.mydomain.com'
     xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'   
     xsi:schemaLocation='http://www.domain.com/schema.xsd'> ...
    </document>

The default namespace declaration (the xmlns attribute) points to the schema target namespace.

The schemaLocation attribute points to the schema location.

2. THE noNamespaceSchemaLocation ATTRIBUTE

If a schema doesn't include a declared target namespace, then it belongs to the null namespace. Instances using the schema must also keep the default namespace as as the null namespace, and should use the noNamespaceSchemaLocation attribute to point to the schema.

    <document xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
     xsi:noNamespaceSchemaLocation='http://www.domain.com/schema.xsd'> ...
    </document>

No default namespace is declared (and so it is null), and the noNamespaceSchemaLocation attribute points to the schema location.

Site Feed

About this Blog

This blog is produced by the consultants and analysts from Really Strategies, a content solutions and services provider.

A Content Management System for Publishers

Search This Blog

Lijit Search

Browse Archives

Browse a list of posts by author.