"Taxonomies are dead. Long live metadata!"

This is number 3 on the Technnolgy Predictions for 2009 on CMS Watch.

"With social computing coming to the fore, it's never been more obvious that everyone does not, and will never, categorize things in the same way. It doesn't even matter what's correct anymore (well, it does to me, but I'm not about to spend my days stopping people from tagging a map of Botswana with the word "Ohio.") While I'll never agree with David Weinberger's assertion that "everything is miscellaneous" (a taxonomist's least-favorite word), I will assert that the days of the traditional, definitive, and single-hierarchy taxonomy are long behind us.

Enter the varied and multi-faceted application of metadata, experienced as people would like to experience it. In the search world, Endeca popularized it, now it's a commodity. You should be able to get to information the way you want, which may be different from your colleague's approach. We still need controlled vocabularies. We still need to tag content. Text mining and auto-tagging software is gradually improving, and extracted terms can be applied as metadata. But that metadata needs to be a lot more fluid, cloud-like, and by no means fixed in a single hierarchy. And even if it doesn't make sense to you that that map of Botswana is tagged with the word "Ohio" -- it probably makes perfect sense to someone. One person's chaos is another person's perfect path to findability."

Opinions?

Live DITA Application: FASB U.S. GAAP Codification

The work of all accountants doing commercial accounting in the U.S. is governed by the Generally Accepted Accounting Principles (GAAP), created and maintained by the Financial Accounting Standards Board, a member-supported organization mandated by the U.S. Congress.

Historically the GAAP has been created as a mishmash of different documents and supporting interpretation and commentary. There was no single organizing schema or source. In short, it was essentially impossible to determine whether or not you had found everything relevant to a given accounting issue.

To address this problem, the FASB decided to create a new all-encompassing classification taxonomy for the GAAP and codify all existing GAAP standards under this taxonomy. This project has been going on for over four years and has resulted in the Accounting Standards Codification, or ASC. The ASC content is currently undergoing an extended period of public review and is available through the FASB ASC Web site: http://asc.fasb.org/home.

While the ASC taxonomy itself was a major achievement, the codification activity was a daunting editorial process in which all the existing standards content had to be re-authored in a new form that directly reflects the taxonomy. To support this activity the FASB decided to use an XML-based system, which should come as no surprise.

But beyond that, the FASB realized several important things:

  • The GAAP content is highly modular
  • The GAAP content can be organized in many different useful ways depending on how it is being used:
    • By subject
    • By industry
    • By business process
    • By what's of immediate interest to a particular person researching a problem or set of problems.
  • The GAAP content requires rich metadata to enable accurate search and retrieval as well as binding to the new ASC taxonomy
  • Licensees of the content will want the XML source and will want to be able to use it with as little effort and expense as possible
  • The FASB does not have huge budgets for XML application development and implementation yet needs non-trivial systems for authoring and managing the GAAP content through its editorial processes as well as for delivery through the authoritative FASB Web site.

Given the foregoing, the FASB realized that a more traditional XML application, while possible, would not necessarily be optimal and would likely be prohibitively expensive and would not meet the requirements of licensees for ease-of-use of the XML content.

However, a DITA-based application would satisfy all these requirements. David Prather at FASB realized that the GAAP content could be modeled quite handily using DITA with some GAAP-specific specializations.

David worked out a clever way to use DITA maps to manage the organization and packaging of the codified GAAP content and hired me to design and implement the necessary GAAP-specific specializations (as well as do the data conversion from an initial XML format they had used for the initial codification editorial work). The FASB selected Ovitas to implement a new editorial support CMS system as well as the dynamic delivery system used to serve the ASC content through the FASB Web site.

The project went remarkably quickly--we had working DITA specializations defined and in place in a matter of weeks and the models required only minor refinement as the system implementation progressed, mostly stemming from new understandings of the underlying content as the codification editorial process approached completion. The CMS and Web site implementation went equally smoothly (remarkably so in my experience building such systems).

Because we could use the free DITA Open Toolkit to generate HTML sufficient for internal review of the codified content we didn't need to invest any time or money in acquiring or building rendering support just to support internal Q/A of the DITA content, a significant savings. Essentially, it allowed one part-time consultant, me, to do what would in the past have required a team of three or four consultants months of work to implement. By the same token, we were able to use the off-the-shelf DITA support in XML editors like Arbortext Editor and OxygenXML, removing the need to invest in document-type specific editor configurations and customizations, again saving weeks or months of consultant time. I think I spent about two days coming up to speed on how to configure Arbortext Editor to work with specialized DITA document types and about 1/2 day creating the necessary configurations (it's essentially a copy and modify process that I can now do in minutes).

Likewise, the Toolkit means that licensees can do *something* with the ASC content immediately, as well as giving them a solid base from which to develop whatever internal processes they need. Large publishers with existing XML infrastructure can of course apply that, but smaller publishers with little or no XML infrastructure can still take immediate advantage of the ASC XML source.

The ASC content is currently undergoing an extended period of public review and is available through the FASB ASC Web site: http://asc.fasb.org/home. The content is served dynamically from a slightly sanitized version of the DITA source--it is not static HTML pages generated from the DITA source.

The FASB ASC application is a working example of how the unique features of DITA XML applications significantly lower the cost of building this type of system while enabling significant value for the DITA-based content itself.

One interesting side effect of this system is that most, if not all, of the FASB's licensees, which include all the big name publishers and many smaller ones, will end up with both DITA-supporting internal systems as well as internal DITA expertise that can then be quickly and easily applied to any other DITA-based content, regardless of its markup details or subject domain. That seems pretty interesting to me....

Automating topical classification

Imagine this scenario.  You've built (or bought) and maintain a comprehensive controlled vocabulary or taxonomy in a specific subject area.  That means that you have either spent time, effort and money employing a subject area expert to meticulously create a topical hierarchy that suits the needs of your content set, or you have spent money buying and customizing one of the many off-the-shelf taxonomies that best suit your knowledge area.

The problem is, you're not even half done.  Now you need to apply this topical metadata to your ever growing content set, and invest in tools that utilize this metadata for search, retrieval and export of your content.   Did I also neglect to mention that taxonomies need maintenance to adjust for changing events in the subject area?

It might be difficult to believe, but this is just the scenario that many publishers find themselves in after plunging head first into a taxonomy project without first considering the long term resource needs of such an effort.  For large taxonomies and content sets, the manual assignment of topical metadata is a significant and ongoing resource issue.  As such, many publishers are turning to automatic classification technology to automate their classification needs. 

There are two core choices for classification technology, rules-based and statistical matching.

Rules-based classification uses simple Boolean rules to assign topical metadata.  This is often a simplistic rule such as whether a series of words, chosen by an editor or matching on the title alone, are contained anywhere in the given piece of content.  This can sometimes be enhanced by creating additional rules about how frequently the term must appear in the content, normalized against the content size.

Statistical classification uses a vector of terms that are chosen by an off-the-shelf technology component that are statistically relevant based upon a series of training documents.  This has the advantage of providing a specific context to the matching criteria.  Some taxonomy providers provide their controlled vocabularies with the statistical matching rules already built-in.

Both statistical and rules-based classification engines use user defined thresholds as the final deciding factor about whether or not a piece of topical metadata will be assigned.  Some also use ISO 2788 thesauri to expand matching terms to synonyms, related, and broader/narrower terms.

So which method is right for you? 

Rules-based classification is remarkably accurate for scientific, medical and pharmacological taxonomies that have very distinct matching terms.  Topic titles like "Abscisic Acid" or "Hydroxypalmitate" which are contained in content are very likely to be a good topical match.  The downsides to using rules-based classification are the time investment necessary to create the matching rules and the false negatives, or missed assignments, due to very rigid matching terms.

While imperfect, statistical classification is a better fit for general knowledge, historical, current events and financial taxonomies.  Topic titles like "Aaron" could refer to the biblical Aaron, Hank Aaron, Aaron Spelling or even Aaron's Bicycle Repair!   Statistical classification uses the context provided from the original training set to tailor results very specifically to your content set.  This technology is likely to produce false positives, but if tuned properly can be editor-assisted and effective.

Though research has been ongoing for decades in this area, the explosion of digital content and the growing power of CPUs has led to a renaissance of sorts in this technology area.  We can only expect the technology to get better with greater investment and wider use.

Using rivals to Inform readers

Thanks to Chet Ensign for forwarding a link to this story about newspapers linking to rival sites. This could easily apply to other online publishers as well.

The story describes how Inform's publisher services product uses topical metadata to create links to related online content. (Be sure to view the animated brochure at the bottom of the page.) If you're a publisher using the service, the Inform software finds topics in your site's content and creates inline links or sidebar link lists for you to include. It also provides premium services for things like search term disambiguation. Pretty cool technology, and pretty interesting business model for those publishers willing to set aside traditional concerns about linking to rival sites in exchange for the (to my mind) more important goal of satisfying their readers' desires for relevant information.

Inform also has a free news service that provides a great illustration of applied topical metadata, especially for general news publishers. Just go to their home page and click on the related topics links to get a glimpse inside their categorization scheme (topics, industries, people, places, companies [organizations], products). They also use standard news categories for their primary navigation, plus some additional subcategories that I'd guess map to their topical categories. Site users can customize their access to the news by subscribing to online sources, topics, and RSS feeds of their choice. Good stuff for the reader, but mostly a nice illustration to the publishers Inform wants to serve of the kinds of things they could do on their own sites.

Side note - Reading stories about Inform made me wonder about the complicated web of relationships forming among newspapers, AP, Google, and services like Inform. Definitely some overlapping territory here. And by searching for AP and Google on Inform's site, I found this story.

Taxonomize

I was at a meeting with a publisher client last week discussing metadata and the head of web development used the word taxonomize as in "then we taxonomize the content before publishing it to the web."

I've worked on a number of projects involving metadata and taxonomies but never came across this word, which turns the noun taxonomy, into a verb. I've always needed to use phrases with other verbs, like apply or assign (the taxonomy), or use a synonym like classify.

It reminded me of some of the -ize words I hear in this industry but don't particularly like, such as monetize or productize.   But there are a number of other -ize words I don't mind or have come to accept, like digitize.

I am curious to see if taxonomize is used at the Taxonomy Boot Camp this coming November, but I can't find it anywhere on the site.

After the meeting, I realized taxonomize fit naturally into the conversation and avoided longer worded descriptions for the same action. So, I like it, or at least I find it useful.  And I knew I wanted to get back and blogize it.

Taxonomy Boot Camp 2006

I've been asked to speak at the Taxonomy Boot Camp 2006 in San Jose, CA in early November.  My presentation is on "Defining your strategy" and will look at developing an overall strategy for metadata, including taxonomies.  The conference itself is obviously focused on taxonomy work and goes beyond publishing to include a focus on enterprise taxonomies.  But I see Roger Sperberg, who is working with Wolters Kluwer, and Susan Saraidaridis—who I don't know but has the cool title "Enterprise Taxonomist & Metadata Manager"—from Harvard Business School Publishing are also presenting.  The line up and topics overall look really interesting and should be worth it for anyone working or planning to work with taxonomies.

Writers write and categorizers categorize

In today's IT World, Sean McGrath offers up a good piece of advice regarding where in the workflow to assign classification or taxonomical metadata to content.

He says...

...many metadata based content management systems have trouble getting good metadata out of content creators. The 'aboutness' of the stuff that was used to design the content management system was obvious because it was created after the content itself. However, for new content, the 'aboutness' has yet to be cooked so to speak....Writers write and categorizers categorize. There is an unavoidable delay between the two activities. The writers and the categorizers can be the same people but the activities are very different and cannot be done at the same time.

I am not sure my feeling on the topic is as concrete as Sean's.  I do think there are some cases in which the content development and categorization can - or needs to be - done at the same time. It is one of those "there is no right answer" things that depends on circumstances and the people and processes involved. But I do agree generally and I will say I've seen more of what Sean recommends than the other way around.   And I think it is especially true for publishers who are doing some additional processing for web publishing.

But more importantly, I've often found that even with those who do separate out these activities, there is always this feeling that it would be "better" to move the categorization and classification work more towards the point of content creation (the "why can't we get our authors to tag the content" question).  What I really like about Sean's point is that it questions that assumption.  It is not necessarily better to do it that way.  In fact, there are very good reasons to separate the two activities and the division can indeed be the "better" process.

Tags and taxonomies

Gene Smith posts an interesting round up of "the year in tags" on the You're It blog.  I've tried to follow the "tagging" phenomenon and it is interesting to see the activities of the last year laid out like this, especially the acquisitions of del.icio.us and Flickr by Yahoo and the introduction of tag like functionality by Amazon and Google.  Obviously some big players are paying close attention to the use of unstructured tags by the masses.

From a traditional publishing perspective, it is interesting to think about the uncontrolled and arbitrary application of keywords by anybody and everybody versus the very controlled classification of content by subject area experts we are much more used to.  How does tagging compare with professional taxonomy categorizations and very structured metadata?  Is one superior to the other?

This is a discussion going on for a year now and like most things, it is not an either-or choice.  There is certainly room for—and value in—both.  Assigning some tags to your photos and favorite web sites is obviously much different than applying a taxonomy to published medical research.  But if free style tagging is as big as it seems to be, how will publishers take advantage of this trend?  And should they?

For example, getting most authors to properly apply structured metadata has always been a challenge (or at least not a priority amongst the many things they need to do).  Would authors be more receptive to tagging?  Would this add any value to the content?  Is it better than nothing?  Does opening up content for users/customers to tag within your electronic products make sense for your product, market or audience?  How do social tags mix in with more formal taxonomy classifications?  All good questions I hope we see explored in greater depth in 2006.

Microsoft and OCA, content formats for digital libraries

If you haven't heard, Microsoft has joined with the Open Content Alliance (OCA). The OCA is the creation of Yahoo and the nonprofit Internet Archive, and Microsoft's participation is drawing more attention to how the OCA's efforts to digitize content contrast with Google's. This week CNET published an article that nicely articulates the differences in approach. The OCA has taken a less contentious route than Google by limiting their effort to those works that are in the public domain, unless the copyright owner has given permission that they be included. (While you're reading the CNET story, check out the cool "Big Picture" feature on their site. It provides a graphical means to navigate CNET content by several different subject areas. Think "Semantic Web".)

I haven't found much information about the digital content format to be used by Google or OCA. Adobe is part of OCA, and at least some OCA content will be stored as PDF (with fulltext in the background for search?). OCA will also be gathering multimedia content as well as print sources. If you go to the Google Print site, you'll see that book pages appear to be captured as images. It would be especially interesting to know what kind of metadata is (and isn't) being captured by both groups. For example, it seems obvious that the Google system doesn't "know" when two books are actually the same classic (say, Shakespeare's Macbeth) published by different publishers. (And I'm not suggesting it makes business sense for Google to capture this information.) In fact, if you search for "Shakespeare Macbeth" on the site, you get more than 32,000 results, and the versions of the play Macbeth don't all float to the top - a book on Kurosawa is in position 5 (today). While to a particular reader this might not matter too much, it certainly will be of interest to the publisher. (Simon and Schuster's Macbeth is in position #1, but Kessinger's doesn't show up until the 3rd page of results.) Can the publisher influence this in any way? Should they be able to?

Site Feed

About this Blog

This blog is produced by the consultants and analysts from Really Strategies, a content solutions and services provider.

A Content Management System for Publishers

Search This Blog

Lijit Search

Browse Archives

Browse a list of posts by author.