RSuite CMS provides loud and clear answer for Audible.com

We are often asked by publishers to describe the real business impact RSuite CMS has on our clients. Along with my previous post on Blood-Horse Publications, Audible.com is another client that has leveraged the power of RSuite to realize its business goals.

Audible, Inc., an Amazon.com, Inc. subsidiary, is the leading provider of premium digital spoken audio information and entertainment, on the Internet.

In early 2007 Audible.com launched an aggressive project to revamp their entire metadata program to better manage and process the metadata files they receive from their publishing partners.This program had the following business objectives to meet:

  • Ensure error-free metadata by using publisher or publisher aggregators as the source of data, and by developing new tools to drive, search, browse, and publish to store functions off this sourced data.
  • Ensure the ability to identify Audible products on partner sites by providing ISBNs that correspond to the downloadable digital binding with each product in feeds to partners, wherever and whenever possible.
  • Reduce the occurrences of human error by automatically populating data into web site databases, from the sourced data.
  • Improve findabilty, searchability, and marketability of products by standardizing keyword, category, authors, contributors, and publishers.
  • Improve royalty systems by making contract entry a requirement for any product being pushed to an Audible site.

During a 4-week proof of concept (POC), RSuite was configured to prove out several use cases:

  1. Leverage RSuite’s workflow tool to ingest ONIX feeds and audio files
  2. Apply additional metadata (both manually and automatically)
  3. Distribute the appropriate content packages to target delivery sites.

During this stage many business rules were also documented that were applicable to improving Audible's business opportunities. After a successful POC, Audible.com selected RSuite for its metadata and aggregation solution.

RSuite became the framework upon which Audible crafted solutions to meet all its requirements: workflow, business rules validation, content aggregation and delivery. In 6 months, RSuite was configured and implemented to become Audible's workflow tool, which enables seamless transfer of content from publisher feeds to web site-ready files.

Now after using RSuite for over a year, Audible has realized its goals of integrating a tool that would satisfy the business objectives and show a return on investment quickly. As Art Zegarek, director of data architecture told our team, “RSuite has become a very critical system very fast!" It is satisfying to know that RSuite is helping an aggregator such as Audible.com meet its business objectives every day.

"Taxonomies are dead. Long live metadata!"

This is number 3 on the Technnolgy Predictions for 2009 on CMS Watch.

"With social computing coming to the fore, it's never been more obvious that everyone does not, and will never, categorize things in the same way. It doesn't even matter what's correct anymore (well, it does to me, but I'm not about to spend my days stopping people from tagging a map of Botswana with the word "Ohio.") While I'll never agree with David Weinberger's assertion that "everything is miscellaneous" (a taxonomist's least-favorite word), I will assert that the days of the traditional, definitive, and single-hierarchy taxonomy are long behind us.

Enter the varied and multi-faceted application of metadata, experienced as people would like to experience it. In the search world, Endeca popularized it, now it's a commodity. You should be able to get to information the way you want, which may be different from your colleague's approach. We still need controlled vocabularies. We still need to tag content. Text mining and auto-tagging software is gradually improving, and extracted terms can be applied as metadata. But that metadata needs to be a lot more fluid, cloud-like, and by no means fixed in a single hierarchy. And even if it doesn't make sense to you that that map of Botswana is tagged with the word "Ohio" -- it probably makes perfect sense to someone. One person's chaos is another person's perfect path to findability."

Opinions?

Real metadata

I'm adapting a detailed presentation on metadata for a webinar on metadata next week, and thought it might be helpful to post some excerpts. Most CMS products are weak in managing "real" metadata, and impose limitations on a publisher's product development. Until you understand metadata, it's hard to understand why this matters so much.

A typical CMS product stores metadata as name/value pairs in a relational database. Most metadata doesn't want to live like that. If you've invested in XML for your documents, you should also invest in it for your metadata. Here are a few reasons:

1. Most metadata is more naturally modeled as XML than as simple tables.
2. Some metadata lives most meaningfully in a content context (inside the document)
3. Metadata markup should conform to document markup

This illustrates one of the greatest selling points of a CMS that uses XML as its native format - your metadata can be any XML you want it to be. (Probably not a surprise that RSuite is one of these!)

Read on for more detail.

1. Most metadata is more naturally modeled as XML than as simple tables.

Metadata has
- Internal markup
- Hierarchy
- The dreaded mixed content

Simple tables can't or make it really hard.

For example, journal article contributors often:
- Come in groups
- Are typed (author, editor, …)
- Are ordered
- Have internal structure (first, last, etc)
- Include spacing and punctuation around internal elements
- Are related to/contain affiliation information

2. Some metadata lives most meaningfully in a content context

Example: headings
- Are naturally authored while authoring the document itself (news headlines can be an exception)
- Can be for document sections
- Can contain footnote references to document locations
- Are presented as part of the document for editorial review or to readers

3. Metadata markup should conform to document markup

Example: Journal article abstracts contain formatting and other inline elements, paragraphs, and even lists. It makes sense that the tags used for the rest of the article also be used in the abstract. This can be awkward to impossible depending on the approach taken with a relational model.

These are three modeling reasons to use an XML-aware CMS for metadata. There are other kinds of reasons also. For example, you might not know that a field qualifies as metadata until well after your CMS is deployed - when you are trying to create a new product. Finding this out "late" will cause you pain (time and money) in most environments. You'll be in a much better position if your CMS allows you to treat any content in your XML document as metadata. This same example - not knowing about usage until after CMS deployment - is also a good example of why even a custom relational CMS (more than name/value pairs) still limits publishers too much.

XML2007 Day 1 (publishing track)

Here are my notes from XML2007 in Boston, Day 1. Since I'm responsible for the publishing track scheduling, I'm hanging out in here all conference.

******

Opening Plenary - Does XML Have a Future on the Web? The big takeaway from this panel was that developers in the real world are still being confronted with the fact that a lot of data can be modeled very effectively relationally and as objects, and using XML for such data imposes some unwelcome complexity, especially in terms of how to map the data to structures available in programming languages. JSON provides an easier way for such developers to work with content. On the other hand, some content doesn't fit well into this model, and so XML's complexity provides value well worth the cost. It was fun to see Michael Sperberg-McQueen and Douglas Crawford "discuss" this divide, but most audience members found value in both perspectives, and didn't seem to take seriously Crawford's notion that XML has been outright dangerous because in part of its being a distraction from the evolution of other web standards like HTML and javascript. The third panelist, Michael Day, offered some practical perspective from the viewpoint of a software provider that needs to wrestle with all the ways in which information might be published on the web. As he said, he saw no reason to privilege one format (HTML, XML, JSON) over another. He also made a comment about thinking CSS could potentially be used for sophisticated print formatting not only for the web. I'd like to hear more about that.

******

Eric Severson from Flatirons spoke about practical DITA lessons. (1) It's harder to model and actually develop content for re-use than you might think. Shouldn't be an all-or-nothing approach - re-use where is a lot of benefit, don't force-fit the re-use in other cases. Reconciliation also doesn't all need to happen at the time of DITA adoption. (2) How deal with approval process when no longer working with publications - working with topics? Create a map that is specifically designed to facilitate review - includes enough context for review but probably not the same as the publication. (3) Use specialization only when absolutely necessary - tools don't yet have much support for specialization. DITA committee is trying to address some inflexibility in generic task model that means people end up specializing when they might not want to. If need to specialize, specialize from the standard types as much as possible (task is exception because of other issue). Domain specilization is also an area where specilization is very often justified (for keyword typing, for example). (4) Use conref (enables re-use of content objects inside a topic) only in cases where want to create an index or list of things - for true re-use, can become very limiting to authors - hard to write text for so many contexts. Avoid nested topics for similar reasons. Maps and nested maps provide a better way to do this - impose less overhead on the topics themselves. (5) Dynamic content delivery - Important benefit of DITA is the metadata on topics/maps that indicate audience and other information. Rather than build a static publication on a topic, allow users to leverage that metadata in search.

******

Matt Turner from Mark Logic talked about Office Open XML (the XML underneath Office 2007). Quote: Office Open XML is cool because it's XML and you can mess with it. The new Office writes XML natively - no other format living in between the applications and XML. This is not the previous XML formats - this is new. Spec is huge and complicated and hard to read because of need for backwards compatibility and need for performance (one letter elements). Spec defines a zip package of a document's data (XML) and other items (images etc). The individual XML items in the zip are easy to interpret. The Office apps are now OOXML editors - and other (non-XML) applications could be OOXML editors also. Can bind a control in document (like for a form) to an XML instance inside the document or dynamically retreived. Have successfully generated OOXML from other data sources (showed demo with Shakespeare's plays). Demonstrated structural editing inside Word. The Office ribbon (replacement for toolbars/menus) can be configured (with XML) and customized to provide the kind of editing tools that are desired, including interaction with other server applications. I can't possibly describe Matt's mashup of tech docs and As you Like It, but it was both information and amusing. Thanks, Matt.

******

XML Authoring Tools panel with Justsystems (XMetaL), Adobe, Xopus, moderated by our own Mark Jacobson. XMetaL: Sweet spot is for direct creation of XML technical documentation. Vision of company is to be able to use other Justsystems product (xfy) to enable environments in which content from multiple schemas can be mashed together and where documents can include application logic. Adobe: Concept is to enable tools that address cross-media workflows whether that's simple XML docs using the XML tagging features in Creative Suite or whether it's focusing on layout and automating the generation of the XML later. Also want to get to re-purposing and to support for authoring based on rules or scenarios. For future thinking about schema language including RELAX NG. Xopus: Focused purely on XML editing by non-technical users in a web browser. Discussion: Mark pointed out two contexts for discussion - ubiquity of Word and expectations that brings, and the fact that people don't like to edit inside a structure.

******

Bob DuCharme - XHTML 2 for Publishers: New opportunities for storing interoperable content and metadata. 1.0 was about separation of design and content. 1.1 was about modularization rather than features. 2.0 goals: encode more semantics, more device independence, better forms (XForms), less scripting (XML Events). (Note, not a W3C recommendation yet.) The first two of these drive Bob's contention that XHTML 2 can now be used for publisher content - probably not as the primary source, but for more than just as browser format. Obvious example: interchange among organizations. Why consider it: DTDs can be really complicated - overwhelming and intimidating. HMTL is familiar and simpler. How does this work? STRUCTURE: HTML has a <section> element for grouping. The <h> element represents a heading regardless of level. So, have structure, can change levels without re-tagging. BETTER SEMANTICS/STRUCTURE: separator rather than hr. pre can be embedded inside p, which means can do things like present a single paragraph across multiple lines. lists can be embedded inside paragraphs so relationships are clearer. p as img - show image if available (or depending on device), otherwise show the paragraph text. Use of role attribute that points to namespace rather than class attribute (similar to DocBook). (Use of class for semantics can interfere with use of class for stylesheets, plus class is supposed to be nmtoken.) METADATA: Can use RDFa to embed your own metadata. The predicate and value go on elements that represent the subject (or that are contained in the subject object). So, maybe this:

<section><span property="dc:subject" content="recipe"/>...</section>

Can also do this:

<meta about="http://mynamespace" property="dc:subject" content="recipe"/>

Or even put an id on a content object (like a section) and point to it from meta tags.

But, as one attendee pointed out, HTML 5 is on a separate path than XHTML 2, and isn't at all clear that XHTML will get much support from browser vendors. Regardless, it appears to be a simple, familiar, but reasonably powerful way of sharing documents even if there is never any expectation of viewing them directly in web browsers.

******

Eric Clark of Time and Lee Vetten of McGraw Hill reviewed what's new in PRISM 2.0:

  • Addition of elements to reflect more complicated workflow (sometimes web-first, sometimes print first) – original platform, web channel, killdate, postdate.
  • Support schema as well as DTD
  • Profiles: XML only profile, rdf/XML profile, also XMP profile now (especially for PDF archives)
  • Updated controlled vocabularies
  • Added aggregation type, genre, and presentation rather than just the previous “category”
  • Added roles for creator and contributor
  • Added a bunch of other elements, including some inline elements
  • Eliminated PRISM 1.0 elements that were redundant to Dublin Core elements

Note: 2.0 is not backwards compatible.

Future work: Subcommittee around rights management (tracking and handling of digital assets). Creating a cookbook document that will help implementors understand how to support some standard use cases. Will roll out via webinar in January.

2.0 docs available now but not 100% complete. Final posting expected in December.

******

Jens Erlandsen spoke about a Swedish Dictionary project for the Swedish Academy (the ones that give the Nobel prize). Dictionary is modeled after the OED - massive scale. Been working on the first edition since 1898. Expect to complete in a few more years. 200 million characters, XML= ~600MB. Happy with manual workflow - working with slips of paper that can sort and look at more usefully than if digital. Dictionaries are special - average number of characters per element is about 7 for this dictionary. Dense, highly marked up, high quality content. Lots of element types - hundreds. The editorial rules are complex and unstable. And some rules can't be reflected in XML - homographs, sorting rules, etc. So how to build a schema? What to leave out? Jens' main theses/questions: (1) A schema can't be developed outside the context of how it will be used and what tools will be used. (2) Can one schema support all needs? No - different parts of the process need different schemas. (3) What is needed beyond schemas to capture all rules? Something - what? Jens covered the approach in detail. This was a great illustration of one of the points from the opening session dictionary content cannot be represented with name/value pairs. Jens also drove home that an authoring schema can't be designed without lots of experimentation to see what it's like to actually use them as an author/editor - eg to allow users to author flat content and add structure later, to re-organize entries, and so on. He mentioned that they used the iLEX tools for their project, which seem to be pretty cool.

******

Great day.

PRISM 2.0 open for public comments

The Publishing Requirements for Industry Standard Metadata (PRISM) working group has released the PRISM 2.0 specification for public comment.

From the press release,

This major revision of PRISM addresses the new requirements for publishers and media companies to deliver content in an online multimedia environment, as well as in print.  According to Lee Vetten, McGraw-Hill Business Information Group‘s Co-Chair of the PRISM Working Group, “PRISM 2.0 heralds a new generation for PRISM. Today’s magazine publishers have made a dramatic shift to delivering eMedia-based content online as well as traditional print magazines. The development of PRISM 2.0 reflects the commitment of the PRISM Working Group to mirror today’s new publishing models in the specification.” 

Dianne Kennedy, IDEAlliance Vice President of Publishing Technologies comments, "Based on a series of focus groups conducted during 2006, we have undertaken an aggressive update of the PRISM Specification to address content that, for the first time, appears online before it is cast in print.

Visit the PRISM web site for more information, to sign up for a webinar, and to download a copy of the 2.0 specification.

What I want from Adobe - x-ray file formats

Consider The big content system integration diagram (draft 1).  What are the bottlenecks for the content stream?  Well if what I'm doing is any indicator, importing and exporting from InCopy or InDesign still has a little friction to it.  This is being addressed by Adobe progressively - see CS3 (yay!) for example over CS2.  Also being addressed by Softcare and other companies, again progressively.

But is there a fundamental change that can happen here?  Can we get XML that can pass through Adobe formats frictionlessly like Superman can see through walls with his x-ray vision?  (Digression here: There is a surprising amount of controversy on the Web about Superman's powers - many self proclaimed pundits seem to say that his X-Ray vision is unrealistic!  Not science that it is caused by low gravity on his home planet?  Baloney!)

Maybe the problem isn't that content can't yet be imported and exported seamlessly, maybe it is that content shouldn't be imported and exported at all.  If Adobe considered InDesign/InCopy not to be a holder of data, but an aggregator of data for print layout, then we might start getting somewhere.  Already, images are externally linked, why not, we might then ask, have text objects be externally linked?  I'm not talking about the page geometry, etc. that might live inside of an InCopy document.  I'm talking about just the text - and as XML if you please.

If Adobe therefore allowed XML content to remain external as files, and it allowed all external content, XML or images, to be linkable via HTTP protocol or otherwise, then we might have a situation where media and XML management systems could maintain content continually - without having to messily import during print production and export after print production.

Think of all the advantages!  Tremendous.  Metadata could be added externally, preparation for web could happen simultaneously. And Adobe page apps would still manage page layout and editing within page geometries so no skin off their nose.   Of course, this is not an easy thing to accomplish - but it is the logical future state - each format to its own system, with specialized apps consolidating, editing, and arranging the last mile of content.

PRISM plans to cover metadata for online content

The Publishing Requirements for Industry Standard Metadata (PRISM) group is planning to extenPrismd the PRISM standard to include metadata for online content (whether that content is "online-only" or online is one of its targets).  The group is making the discussion public, with the hopes to hear feedback from others.  You can find change logs on the discussion topics here.  I assume more updates and progress will be noted on the main PRISM site.

Automating topical classification

Imagine this scenario.  You've built (or bought) and maintain a comprehensive controlled vocabulary or taxonomy in a specific subject area.  That means that you have either spent time, effort and money employing a subject area expert to meticulously create a topical hierarchy that suits the needs of your content set, or you have spent money buying and customizing one of the many off-the-shelf taxonomies that best suit your knowledge area.

The problem is, you're not even half done.  Now you need to apply this topical metadata to your ever growing content set, and invest in tools that utilize this metadata for search, retrieval and export of your content.   Did I also neglect to mention that taxonomies need maintenance to adjust for changing events in the subject area?

It might be difficult to believe, but this is just the scenario that many publishers find themselves in after plunging head first into a taxonomy project without first considering the long term resource needs of such an effort.  For large taxonomies and content sets, the manual assignment of topical metadata is a significant and ongoing resource issue.  As such, many publishers are turning to automatic classification technology to automate their classification needs. 

There are two core choices for classification technology, rules-based and statistical matching.

Rules-based classification uses simple Boolean rules to assign topical metadata.  This is often a simplistic rule such as whether a series of words, chosen by an editor or matching on the title alone, are contained anywhere in the given piece of content.  This can sometimes be enhanced by creating additional rules about how frequently the term must appear in the content, normalized against the content size.

Statistical classification uses a vector of terms that are chosen by an off-the-shelf technology component that are statistically relevant based upon a series of training documents.  This has the advantage of providing a specific context to the matching criteria.  Some taxonomy providers provide their controlled vocabularies with the statistical matching rules already built-in.

Both statistical and rules-based classification engines use user defined thresholds as the final deciding factor about whether or not a piece of topical metadata will be assigned.  Some also use ISO 2788 thesauri to expand matching terms to synonyms, related, and broader/narrower terms.

So which method is right for you? 

Rules-based classification is remarkably accurate for scientific, medical and pharmacological taxonomies that have very distinct matching terms.  Topic titles like "Abscisic Acid" or "Hydroxypalmitate" which are contained in content are very likely to be a good topical match.  The downsides to using rules-based classification are the time investment necessary to create the matching rules and the false negatives, or missed assignments, due to very rigid matching terms.

While imperfect, statistical classification is a better fit for general knowledge, historical, current events and financial taxonomies.  Topic titles like "Aaron" could refer to the biblical Aaron, Hank Aaron, Aaron Spelling or even Aaron's Bicycle Repair!   Statistical classification uses the context provided from the original training set to tailor results very specifically to your content set.  This technology is likely to produce false positives, but if tuned properly can be editor-assisted and effective.

Though research has been ongoing for decades in this area, the explosion of digital content and the growing power of CPUs has led to a renaissance of sorts in this technology area.  We can only expect the technology to get better with greater investment and wider use.

XHTML and RDFa

I've worked with a number of publishers who are quite content to store their content in XHTML (as opposed to some other flavor or XML).  And why not?  XHTML provides some good structure (XHTML2 even better) and is "ready to go" for web publishing, a main driver for many.  This is why the PRISM group chose XHTML as the format for the body of its PRISM Aggregator Message specification as opposed to a properitary and brand new XML schema. 

But of course XHTML may not provide everything you need, including semantic markup and metadata that can be easily accessible to others. You can create your own class attributes if you want, but that may just work for you.  So into the fray comes things like microformats and RDFa.   So the point of this post is to direct you over to Bob DuCharme's recent article about RDFa on XML.com because it is a great introduction to RDFa.  After opening with an overview of RDFa, the article points out three cases for its use:  inline metadata, metadata about the document itself, and metadata about components of the document.  All good stuff you'll want to do with metadata to add value to your content.

Will RDFa take off?  I don't know, but if you care about metadata and semantic markup and have your content in XHTML you'll certainly want to watch its progress. 

Metadata focus group invite

The PRISM Working Group of IDEAlliance is hosting a focus group on the knowledge and usage of industry standards and metadata.  Obviously there will be focus on the PRISM standard, but there will also be a wider discussion on how publishers are using metadata in their digital asset and content management workflows, as well as for enhancing content for reuse and product development.

The meeting is in New York City on the morning of September 21. Representatives from Time Inc., McGraw-Hill/Platts, and Dow Jones/Factiva will give brief presentations.

Some of the types of questions to be discussed include (these are from the PRISM group):

  • How are companies creating content agility and does it lead to increased monetization?
  • How do companies achieve sophisticated search today - or plan for it in the future?
  • Is there a way to easily share information with other business units, divisions, or outside companies?
  • What does an enterprise metadata strategy actually mean in a practical sense?  What is the value to companies who have one?
  • Do you manage your digital rights or do they manage you?
  • Are most companies achieving success with DAMs (Digital Asset Management) and content management systems and what are the critical factors?
  • Is automating RSS or other types of syndication feeds an imperative?

Invitations are open to publishers only and obviously to those who deal with metadata decisions within the organization.  If you are interested in attending, please contact me or Linda Burman, who is organizing the event for IDEAlliance.

If you cannot attend the meeting, please take a few minutes to fill out the PRISM survey on metadata use and knowledge of industry standards.

Site Feed

About this Blog

This blog is produced by the consultants and analysts from Really Strategies, a content solutions and services provider.

A Content Management System for Publishers

Search This Blog

Lijit Search

Browse Archives

Browse a list of posts by author.