Working with XML in InCopy, InDesign, and RSuite

We've recently announced the availability of the next evolution of the RSuite CS3 Connector, which integrates RSuite with Adobe InCopy and InDesign CS3.

The Connector enables some very exciting features for publishers looking for full XML workflows with these Adobe tools:

  • Publishers can manage their content in the XML flavor of their choice but enable creation and editing of that content within InCopy. This is done through a transformation between the publisher's XML and INCX, the native XML file format for InCopy.  We've enabled some tools to make this a simple process for the end-users, including the ability to navigate for content in RSuite from within InCopy (so users who want to stay within the Adobe application do not need to go to another app) and the ability to open the document in InCopy from the RSuite CMS browser-based interface (for users who are working within RSuite).
  • And then publishers can link to the articles or images managed within RSuite from the InDesign document, also managed in RSuite.  InDesign users can refresh the links to update the content based on changes made to the content by other authors and editors.
  • We've also been involved with a few projects in which RSuite is used to dynamically generate the InDesign document.  In this scenario, the RSuite user assembles the content they want into RSuite's "content assembly" construct and then pushes it through a workflow that dynamically generates the InDesign file with links to the content already in place.  

We've created a short screen cast that illustrates the CS3 Connector if you want to take a peek.

And what about CS4 I hear you ask?  So far, most, but not all, of the publishers we have talked with interested in this type of integration are using CS3.   We don't expect many needed changes in the actual plugins to work with CS4.  The biggest change in CS4 is the underlying Adobe content models, which have changed significantly between CS3 and CS4.  But, fortunately, they have changed for the better and are easier to work with.  So if you have any interest in seeing this type of integration with CS4, let us know.   

Editing XML with Quark and MS Word

Last week I attended the Philadelphia XML User Group where Quark was showcasing some of its new announcements around its Quark Dynamic Publishing Solution.

As we’ve mentioned here before most of the publishers we work with who want to support XML and multi-channel publishing have moved to using InCopy and InDesign.  We used to do QuarkXPress projects years ago, but in the last 2-3 years, any project with a desktop publishing application has been with Adobe’s tools.

But starting with moves last year, Quark is doing what it can to get back into the game.    This is especially good for those organizations who remained with Quark.

At the XML Philly user group meeting, the focus was mostly on the Quark XML Editor, which is the new re-branding of the In.vision tool Quark acquired last year (so new it is not on the web site yet).  The application basically uses a Word interface to interact with XML documents (during editorial cycles the document is always saved as an XML file unless the user chooses to create a copy in .doc or .docx).  The Quark rep said they had to extensively re-engineer Word to make it work they way they wanted. 

The Quark XML Editor is better suited for creating content from the start.  Unlike something like eXtyles, it is not set up to easily take arbitrary Word documents created externally (say from external authors) and push them through the application to generate XML.  So it is something that may be useful in a very controlled environment in which the publisher can dictate the tools to the author base, as well as, um, purchase the plugin for every author. 

One of the attendees pointed out how the people in the room (mostly people with a history and interest in XML and publishing) have been waiting for the ”holy grail” in editing solutions and this tool is not yet it.  But, as we sadly know, that tool does not yet exist.

We’ve been working with some publishers in working with the native XML file format of Word and writing transformations to and from another DTD/Schema (the one the customer needs for XML processing, not the OOXML schema) enabling authors to work in Word mostly "as is" (of course this is Word 2007 with its native XML file format) and then for the publisher to get the XML they want out of that.   For consistently structured content this can work well.  Obviously it gets more complicated the more complicated the content gets, but so does everything.  But I have to think the future will be more about working with Word “as is” with its native XML file format and transforming that to the XML needed by the publisher than with working with tools that mimic or re-engineer the Word interface.

As a last thought, it is simply a fact that not everyone thinks about a document’s structure, and some people struggle with that even with lots of training. This means they struggle to consistently apply styles in the same way they might struggle to add XML markup directly. Simplifying a schema/DTD is probably one of the most useful steps towards enabling successful use of an automated Word to XML approach, regardless of the specifics of that approach.

Quick thought on XBRL

I was just reading Mark Logic CEO Dave Kellogg's post on XBRL and Microsoft's claim to be the first company to use it.  After taking a look at the filing from Dave's blog link, it struck me that:

- The poor pilot companies that are outputting XBRL - presumably by hand!  This is clearly no language to be editing or even converting without software/system support.

- Wouldn't it be great to have a publishing oriented XML CMS like RSuite to help author, edit, and pull required information together! 


Achieving automation: InDesign/InCopy to XML

InDesign and InCopy are built for desktop publishing - giving great power to design and editorial.  This is all great news.  However, it makes exporting XML rather tricky - particularly the development of fully automated XML exports.  Sure you can capture XML coming out of these applications, but can you really push that XML into your CMS without having text processing look at it? 

We've looked at this over many projects and the key issue is, of course, the discipline required by each group in the process.  If they don't follow the rules, then their content might not match what your CMS is looking for.  A deck must be labeled as a deck somehow.  Likewise, a B-Head or run-in head must be labeled appropriately. There are also customer or genre specific structures and metadata that must be maintained - with paragraph or character styles (or one of several other techniques).

The point is that you can't look over everyone's shoulder.  Styling and other structure related errors are bound to creep into your content on occasion.   If you only want to accept well structured XML, then you need the capability to automatically identify errors and only ingest acceptable documents.

While you can create scripts to QC the content during production, this poses a scripting update problem every time you want to change your format structure (every time you do a redesign, perhaps).  And while scripting is extremely powerful in CS2 & CS3, it is pretty low level stuff and time consuming to produce anything complicated.  It is also problematic if you don't have a specialist on staff.  Better to write scripts once and move QC somewhere else.

So what to do?  One solution is a Schema (or DTD) validation technique that allows this QC operation to proceed during an automated export.  The Schema will be more restrictive than just looking at Adobe structures - it will overlay structures specific to your content.  And while updating a schema requires some technical knowhow, it is more straight forward and much faster than updating scripting of any kind.  The reason, of course, is that this is what Schemas are meant to do well.

Using a Schema to validate InDesign/InCopy content can detect a surprising number of human errors with styling and other structuring techniques.  Not all errors, but it can do a solid job if your content is moderately complex.  Content flows into an interim format and is validated before being transformed into its final form in your CMS.  This means that valid content can be fully automated from InDesign to the CMS.  Invalid documents can be automatically siphoned off for review and correction by production.  Users can then be retrained if necessary.

Beats checking every exported document ad nauseum, doesn't it?  Especially at 2am.

Word 2007 add-in for NLM DTD

Something to keep an eye on for STM publishers . . .

Microsoft released a add-in for Word 2007—called the Article Authoring add-in—that assists in creating XML in the NLM DTD.  You can also import a NLM DTD XML document and load it into Word. It seems to be targeted to the authoring point of view, that is, the add-in is designed for the authoring stage, as opposed to the production/editing stage, where  more common (and robust) Word-based tools, like Inera eXtyles, live.  I assume it could also fit into the editing stage, which seems a much more likely place for this thing to happen.

But before you jump in too quickly, realize that even Microsoft warns that the add-in is a beta and not production ready.  And Inera has some cautionary words about Word 2007.

Other links:

  • Download the plug-in  
  • A blog from Pablo Fernicola, the Microsoft product manager, overseeing the project.
  • A video demo of the add-in (from the blog)
  • I'll be bookmarking anything I see on my del.icio.us links

Live DITA Application: FASB U.S. GAAP Codification

The work of all accountants doing commercial accounting in the U.S. is governed by the Generally Accepted Accounting Principles (GAAP), created and maintained by the Financial Accounting Standards Board, a member-supported organization mandated by the U.S. Congress.

Historically the GAAP has been created as a mishmash of different documents and supporting interpretation and commentary. There was no single organizing schema or source. In short, it was essentially impossible to determine whether or not you had found everything relevant to a given accounting issue.

To address this problem, the FASB decided to create a new all-encompassing classification taxonomy for the GAAP and codify all existing GAAP standards under this taxonomy. This project has been going on for over four years and has resulted in the Accounting Standards Codification, or ASC. The ASC content is currently undergoing an extended period of public review and is available through the FASB ASC Web site: http://asc.fasb.org/home.

While the ASC taxonomy itself was a major achievement, the codification activity was a daunting editorial process in which all the existing standards content had to be re-authored in a new form that directly reflects the taxonomy. To support this activity the FASB decided to use an XML-based system, which should come as no surprise.

But beyond that, the FASB realized several important things:

  • The GAAP content is highly modular
  • The GAAP content can be organized in many different useful ways depending on how it is being used:
    • By subject
    • By industry
    • By business process
    • By what's of immediate interest to a particular person researching a problem or set of problems.
  • The GAAP content requires rich metadata to enable accurate search and retrieval as well as binding to the new ASC taxonomy
  • Licensees of the content will want the XML source and will want to be able to use it with as little effort and expense as possible
  • The FASB does not have huge budgets for XML application development and implementation yet needs non-trivial systems for authoring and managing the GAAP content through its editorial processes as well as for delivery through the authoritative FASB Web site.

Given the foregoing, the FASB realized that a more traditional XML application, while possible, would not necessarily be optimal and would likely be prohibitively expensive and would not meet the requirements of licensees for ease-of-use of the XML content.

However, a DITA-based application would satisfy all these requirements. David Prather at FASB realized that the GAAP content could be modeled quite handily using DITA with some GAAP-specific specializations.

David worked out a clever way to use DITA maps to manage the organization and packaging of the codified GAAP content and hired me to design and implement the necessary GAAP-specific specializations (as well as do the data conversion from an initial XML format they had used for the initial codification editorial work). The FASB selected Ovitas to implement a new editorial support CMS system as well as the dynamic delivery system used to serve the ASC content through the FASB Web site.

The project went remarkably quickly--we had working DITA specializations defined and in place in a matter of weeks and the models required only minor refinement as the system implementation progressed, mostly stemming from new understandings of the underlying content as the codification editorial process approached completion. The CMS and Web site implementation went equally smoothly (remarkably so in my experience building such systems).

Because we could use the free DITA Open Toolkit to generate HTML sufficient for internal review of the codified content we didn't need to invest any time or money in acquiring or building rendering support just to support internal Q/A of the DITA content, a significant savings. Essentially, it allowed one part-time consultant, me, to do what would in the past have required a team of three or four consultants months of work to implement. By the same token, we were able to use the off-the-shelf DITA support in XML editors like Arbortext Editor and OxygenXML, removing the need to invest in document-type specific editor configurations and customizations, again saving weeks or months of consultant time. I think I spent about two days coming up to speed on how to configure Arbortext Editor to work with specialized DITA document types and about 1/2 day creating the necessary configurations (it's essentially a copy and modify process that I can now do in minutes).

Likewise, the Toolkit means that licensees can do *something* with the ASC content immediately, as well as giving them a solid base from which to develop whatever internal processes they need. Large publishers with existing XML infrastructure can of course apply that, but smaller publishers with little or no XML infrastructure can still take immediate advantage of the ASC XML source.

The ASC content is currently undergoing an extended period of public review and is available through the FASB ASC Web site: http://asc.fasb.org/home. The content is served dynamically from a slightly sanitized version of the DITA source--it is not static HTML pages generated from the DITA source.

The FASB ASC application is a working example of how the unique features of DITA XML applications significantly lower the cost of building this type of system while enabling significant value for the DITA-based content itself.

One interesting side effect of this system is that most, if not all, of the FASB's licensees, which include all the big name publishers and many smaller ones, will end up with both DITA-supporting internal systems as well as internal DITA expertise that can then be quickly and easily applied to any other DITA-based content, regardless of its markup details or subject domain. That seems pretty interesting to me....

XML2007 Day 1 (publishing track)

Here are my notes from XML2007 in Boston, Day 1. Since I'm responsible for the publishing track scheduling, I'm hanging out in here all conference.

******

Opening Plenary - Does XML Have a Future on the Web? The big takeaway from this panel was that developers in the real world are still being confronted with the fact that a lot of data can be modeled very effectively relationally and as objects, and using XML for such data imposes some unwelcome complexity, especially in terms of how to map the data to structures available in programming languages. JSON provides an easier way for such developers to work with content. On the other hand, some content doesn't fit well into this model, and so XML's complexity provides value well worth the cost. It was fun to see Michael Sperberg-McQueen and Douglas Crawford "discuss" this divide, but most audience members found value in both perspectives, and didn't seem to take seriously Crawford's notion that XML has been outright dangerous because in part of its being a distraction from the evolution of other web standards like HTML and javascript. The third panelist, Michael Day, offered some practical perspective from the viewpoint of a software provider that needs to wrestle with all the ways in which information might be published on the web. As he said, he saw no reason to privilege one format (HTML, XML, JSON) over another. He also made a comment about thinking CSS could potentially be used for sophisticated print formatting not only for the web. I'd like to hear more about that.

******

Eric Severson from Flatirons spoke about practical DITA lessons. (1) It's harder to model and actually develop content for re-use than you might think. Shouldn't be an all-or-nothing approach - re-use where is a lot of benefit, don't force-fit the re-use in other cases. Reconciliation also doesn't all need to happen at the time of DITA adoption. (2) How deal with approval process when no longer working with publications - working with topics? Create a map that is specifically designed to facilitate review - includes enough context for review but probably not the same as the publication. (3) Use specialization only when absolutely necessary - tools don't yet have much support for specialization. DITA committee is trying to address some inflexibility in generic task model that means people end up specializing when they might not want to. If need to specialize, specialize from the standard types as much as possible (task is exception because of other issue). Domain specilization is also an area where specilization is very often justified (for keyword typing, for example). (4) Use conref (enables re-use of content objects inside a topic) only in cases where want to create an index or list of things - for true re-use, can become very limiting to authors - hard to write text for so many contexts. Avoid nested topics for similar reasons. Maps and nested maps provide a better way to do this - impose less overhead on the topics themselves. (5) Dynamic content delivery - Important benefit of DITA is the metadata on topics/maps that indicate audience and other information. Rather than build a static publication on a topic, allow users to leverage that metadata in search.

******

Matt Turner from Mark Logic talked about Office Open XML (the XML underneath Office 2007). Quote: Office Open XML is cool because it's XML and you can mess with it. The new Office writes XML natively - no other format living in between the applications and XML. This is not the previous XML formats - this is new. Spec is huge and complicated and hard to read because of need for backwards compatibility and need for performance (one letter elements). Spec defines a zip package of a document's data (XML) and other items (images etc). The individual XML items in the zip are easy to interpret. The Office apps are now OOXML editors - and other (non-XML) applications could be OOXML editors also. Can bind a control in document (like for a form) to an XML instance inside the document or dynamically retreived. Have successfully generated OOXML from other data sources (showed demo with Shakespeare's plays). Demonstrated structural editing inside Word. The Office ribbon (replacement for toolbars/menus) can be configured (with XML) and customized to provide the kind of editing tools that are desired, including interaction with other server applications. I can't possibly describe Matt's mashup of tech docs and As you Like It, but it was both information and amusing. Thanks, Matt.

******

XML Authoring Tools panel with Justsystems (XMetaL), Adobe, Xopus, moderated by our own Mark Jacobson. XMetaL: Sweet spot is for direct creation of XML technical documentation. Vision of company is to be able to use other Justsystems product (xfy) to enable environments in which content from multiple schemas can be mashed together and where documents can include application logic. Adobe: Concept is to enable tools that address cross-media workflows whether that's simple XML docs using the XML tagging features in Creative Suite or whether it's focusing on layout and automating the generation of the XML later. Also want to get to re-purposing and to support for authoring based on rules or scenarios. For future thinking about schema language including RELAX NG. Xopus: Focused purely on XML editing by non-technical users in a web browser. Discussion: Mark pointed out two contexts for discussion - ubiquity of Word and expectations that brings, and the fact that people don't like to edit inside a structure.

******

Bob DuCharme - XHTML 2 for Publishers: New opportunities for storing interoperable content and metadata. 1.0 was about separation of design and content. 1.1 was about modularization rather than features. 2.0 goals: encode more semantics, more device independence, better forms (XForms), less scripting (XML Events). (Note, not a W3C recommendation yet.) The first two of these drive Bob's contention that XHTML 2 can now be used for publisher content - probably not as the primary source, but for more than just as browser format. Obvious example: interchange among organizations. Why consider it: DTDs can be really complicated - overwhelming and intimidating. HMTL is familiar and simpler. How does this work? STRUCTURE: HTML has a <section> element for grouping. The <h> element represents a heading regardless of level. So, have structure, can change levels without re-tagging. BETTER SEMANTICS/STRUCTURE: separator rather than hr. pre can be embedded inside p, which means can do things like present a single paragraph across multiple lines. lists can be embedded inside paragraphs so relationships are clearer. p as img - show image if available (or depending on device), otherwise show the paragraph text. Use of role attribute that points to namespace rather than class attribute (similar to DocBook). (Use of class for semantics can interfere with use of class for stylesheets, plus class is supposed to be nmtoken.) METADATA: Can use RDFa to embed your own metadata. The predicate and value go on elements that represent the subject (or that are contained in the subject object). So, maybe this:

<section><span property="dc:subject" content="recipe"/>...</section>

Can also do this:

<meta about="http://mynamespace" property="dc:subject" content="recipe"/>

Or even put an id on a content object (like a section) and point to it from meta tags.

But, as one attendee pointed out, HTML 5 is on a separate path than XHTML 2, and isn't at all clear that XHTML will get much support from browser vendors. Regardless, it appears to be a simple, familiar, but reasonably powerful way of sharing documents even if there is never any expectation of viewing them directly in web browsers.

******

Eric Clark of Time and Lee Vetten of McGraw Hill reviewed what's new in PRISM 2.0:

  • Addition of elements to reflect more complicated workflow (sometimes web-first, sometimes print first) – original platform, web channel, killdate, postdate.
  • Support schema as well as DTD
  • Profiles: XML only profile, rdf/XML profile, also XMP profile now (especially for PDF archives)
  • Updated controlled vocabularies
  • Added aggregation type, genre, and presentation rather than just the previous “category”
  • Added roles for creator and contributor
  • Added a bunch of other elements, including some inline elements
  • Eliminated PRISM 1.0 elements that were redundant to Dublin Core elements

Note: 2.0 is not backwards compatible.

Future work: Subcommittee around rights management (tracking and handling of digital assets). Creating a cookbook document that will help implementors understand how to support some standard use cases. Will roll out via webinar in January.

2.0 docs available now but not 100% complete. Final posting expected in December.

******

Jens Erlandsen spoke about a Swedish Dictionary project for the Swedish Academy (the ones that give the Nobel prize). Dictionary is modeled after the OED - massive scale. Been working on the first edition since 1898. Expect to complete in a few more years. 200 million characters, XML= ~600MB. Happy with manual workflow - working with slips of paper that can sort and look at more usefully than if digital. Dictionaries are special - average number of characters per element is about 7 for this dictionary. Dense, highly marked up, high quality content. Lots of element types - hundreds. The editorial rules are complex and unstable. And some rules can't be reflected in XML - homographs, sorting rules, etc. So how to build a schema? What to leave out? Jens' main theses/questions: (1) A schema can't be developed outside the context of how it will be used and what tools will be used. (2) Can one schema support all needs? No - different parts of the process need different schemas. (3) What is needed beyond schemas to capture all rules? Something - what? Jens covered the approach in detail. This was a great illustration of one of the points from the opening session dictionary content cannot be represented with name/value pairs. Jens also drove home that an authoring schema can't be designed without lots of experimentation to see what it's like to actually use them as an author/editor - eg to allow users to author flat content and add structure later, to re-organize entries, and so on. He mentioned that they used the iLEX tools for their project, which seem to be pretty cool.

******

Great day.

The big content system integration II

I've modified the 'big' diagram from the first post on this topic to show a circular content flow - now called editorial flow. 

Please find it here: Download the_system_ii.pdf

The diagram is still more conceptual than technical.  Of course at some point this thinking needs to be specialized for the particular publishing vertical, product needs, and company needs.

A few thoughts:

1. Shows content editing flowing in a circle.  Enter at any point and proceed downstream. That is, start  developing a print article or publication and complete it, then proceed to develop it into a web article or publication.  Or visa versa.

2. Prior to entering a print or web editorial workflow there is a content adding, packaging, editing phase, where it is assumed that a web interface will allow review of content sources and collection into the initial manuscript for the subsequent print or web editorial workflow.  This might, for example, allow enhancement of an article - with a new sidebar, for example, as it proceeds downstream. 

3. Implies content reuse if the circle keeps flowing.  The circle can also stop at any point if needs are met.  Some publishers might stop at having a print and web output (in any order), some might stop with either a print or web output, some might keep the cycle going indefinitely, building a large content repository over time (e.g. educational publishers).  The diagram also implies content maintained as XML rather than being imported and exported from editorial/workflow tools.   

4. It has a central repository built of two fundamental parts - XML and binary content (images, etc.).  Work done in page layout tools/editorial tools/workflow tools is transitory (though might be archived).  The purpose of the repository would be to accurately manage 'content' of published products and to also provide a starting point for initial manuscript creation for the next stage in the cycle.

5. Upon completion of the web or print cycle, a number of XML enabled exports are possible along with the main article/publication produced.   This is a requirement of some publishers, and certainly there for the taking, if content is accurately managed as XML.

Well, readers, what do you think?  Does it match your thinking?  Should we keep going with this?

What I want from Adobe - x-ray file formats

Consider The big content system integration diagram (draft 1).  What are the bottlenecks for the content stream?  Well if what I'm doing is any indicator, importing and exporting from InCopy or InDesign still has a little friction to it.  This is being addressed by Adobe progressively - see CS3 (yay!) for example over CS2.  Also being addressed by Softcare and other companies, again progressively.

But is there a fundamental change that can happen here?  Can we get XML that can pass through Adobe formats frictionlessly like Superman can see through walls with his x-ray vision?  (Digression here: There is a surprising amount of controversy on the Web about Superman's powers - many self proclaimed pundits seem to say that his X-Ray vision is unrealistic!  Not science that it is caused by low gravity on his home planet?  Baloney!)

Maybe the problem isn't that content can't yet be imported and exported seamlessly, maybe it is that content shouldn't be imported and exported at all.  If Adobe considered InDesign/InCopy not to be a holder of data, but an aggregator of data for print layout, then we might start getting somewhere.  Already, images are externally linked, why not, we might then ask, have text objects be externally linked?  I'm not talking about the page geometry, etc. that might live inside of an InCopy document.  I'm talking about just the text - and as XML if you please.

If Adobe therefore allowed XML content to remain external as files, and it allowed all external content, XML or images, to be linkable via HTTP protocol or otherwise, then we might have a situation where media and XML management systems could maintain content continually - without having to messily import during print production and export after print production.

Think of all the advantages!  Tremendous.  Metadata could be added externally, preparation for web could happen simultaneously. And Adobe page apps would still manage page layout and editing within page geometries so no skin off their nose.   Of course, this is not an easy thing to accomplish - but it is the logical future state - each format to its own system, with specialized apps consolidating, editing, and arranging the last mile of content.

NCAA to Blogger - You're Out!

If you haven't seen the story, last week a reporter who was blogging at an NCAA baseball game was removed from the press box during the game.  Go here to see the story on ESPN.com.  It's always those darn bloggers who are causing all of the problems, isn't it? 

Bloggers kind of remind me of my early days (way back) when I used to skateboard in public places.  Back in the 1970's, you only had a parking lot or sidewalk to show off your talent (of which I had just enough to not injure myself).  Bloggers, to me, tend to be renegades with an agenda not unlike skateboarders who want the freedom to skate where they want and when they want.  Why can't bloggers blog where they want and when they want?  Honestly, not sure. 

If people can text message from anywhere, why can't bloggers blog from anywhere? I would think that both forms of text generation would violate the NCAA rules.  Imagine tossing 1,000 fans from a baseball game because they passed on the score to the game via text message to a friend.  Is the NCAA going to monitor that?  I think they have better things to do, like worry about the graduation rate of athletes.

Ultimately there will be some rules around blogging from the press box due to rights associated with live broadcast, but the action by the NCAA last week at the baseball game to toss the reporter was just a bit too much.

My 2 cents from a former skateboarder (turned blogger).

Site Feed

About this Blog

This blog is produced by the consultants and analysts from Really Strategies, a content solutions and services provider.

A Content Management System for Publishers

Search This Blog

Lijit Search

Browse Archives

Browse a list of posts by author.