« 2006 predictions tour | Main | Publishers, DTDs, and schemas »

January 11, 2006

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83453675c69e200d8346ad54b53ef

Listed below are links to weblogs that reference Tags and taxonomies:

Comments

Michael Puscar

The point of contention in the debate is essentially, what are the advantages of a using controlled vocabulary in tagging documents and how does one control the cost of building, maintaining and classifying against these controlled vocabularies? I will briefly address both of these issues.

The word "taxonomy" has come to mean a variety of different things. Often times it simply refers to an ISO 2788 thesaurus, with broader and narrower terms. While these thesauri are limited in utility, the advantages of using a true multi-level, hierarchical taxonomy are numerous.

When using folksonomies, often times a topic that is associated with a document simply does not provide enough context. What does it mean, for example, for a document to contain the word "cancer" or even "gastrointestinal cancer" as topical associations? Is the document about Gastrointestinal cancer screening and prevention? Phases in the cell cycle? The role of the immune response?

A true hierarchical taxonomy can provide a very granular context and make browsing and searching over large document sets much more efficient.

Secondly, controlled vocabularies provide the ability to combine topical areas to produce queries that are much more efficent than simple coverage or Boolean searches. Take the following use case. The user would like to find all documents about anthrax in the context of biological warfare and not animal infection. This would not be possible with a folksonomy tagged simply with the word "anthrax".

The cost of creating and maintaining a taxonomy can be mitigated in many ways.

Many vendors, such as Fast, Verity, Inxight, Endeca and Convera, offer automatic classification of documents against a taxonomy using rule sets. The role of editors is therefore limited to simply reviewing and approving the results of the classification.

Finally, many tools have been developed over the past five years to assist in the manual creation of taxonomies. However if the cost of doing so is not appealing to an organization, one can now purchase a taxonomy "out of the box" covering a comprehensive set of taxonomies. An ever growing number of vendors now provide pre-built taxonomies, including Wand, Intellisophic and Taxonomy Warehouse.

Marcia Morante

A few comments regarding mitigating the cost of creating and maintaining a taxonomy ....

I agree that these are complex and resource-intensive tasks, and that it would be great to have a set of tools that would really help. But they're not out there yet.

Unfortunately, the vendors that you mention provide tools for bottom-up taxonomy building. They deal only with the "bags of words" available in each piece of content and do not take user or organizational needs into consideration. Also, they produce weird names for the taxonomic nodes.

Reviewing and approving the results are NOT simple tasks. They usually take the same effort, time and skill level associated with building a taxonomy from scratch. There are many related problems, and of course, these tools cannot be purchased independently.

The purchase of an "off the shelf" taxonomy is just as bad. I have never seen two organizations using identical taxonomies. All taxonomies are subjective, starting from the schemes of Melville Dewey and before.

In addition, most companies have content in multiple subject domains - HR, Sales, IT, etc. The "off the shelf" taxonomies are limited to single subject domains. Although they might be a good starting point for some companies in some targeted areas, "off the shelf" solutions require extensive (and expensive) customization.

Michael Puscar

Marcia, I agree that bottom-up taxonomy building is not viable unless the need is very small. Most projects that I have seen run out of steam after only 300 or 400 topics, either due to maintenance or the cost of using a subject area expert.

I would also agree that the "bag of words" technology is not viable. This technology is called statistical clustering, and as you mentioned, the words that a machine chooses tend not to be the words that a human would choose. It is not a good solution.

I would suggest you look at a company named "Intellisophic". They have an impressive taxonomy library with millions of topics covering a deep range of subject areas. The taxonomies are generated as derivative products from vetted, published works. They have provided taxonomies with hundreds of thousands of nodes to customers in the pharmaceutical and government space with fantastic results.

"WAND" also provides reasonably good hand-built taxonomies, particularly in the retail catalog area.

Finally, though reviewing and editing are not zero effort tasks, they are certainly a step beyond manual classification. And, if you choose a good classification engine for your company, you may just find that editorial review is not necessary at all. Many of the customers that I have worked with in the past have found this to be the case. The key is choosing a classification engine that suits your needs. Each has advantages and disadvantages.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.

Our Company