Gene Smith posts an interesting round up of "the year in tags" on the You're It blog. I've tried to follow the "tagging" phenomenon and it is interesting to see the activities of the last year laid out like this, especially the acquisitions of del.icio.us and Flickr by Yahoo and the introduction of tag like functionality by Amazon and Google. Obviously some big players are paying close attention to the use of unstructured tags by the masses.
From a traditional publishing perspective, it is interesting to think about the uncontrolled and arbitrary application of keywords by anybody and everybody versus the very controlled classification of content by subject area experts we are much more used to. How does tagging compare with professional taxonomy categorizations and very structured metadata? Is one superior to the other?
This is a discussion going on for a year now and like most things, it is not an either-or choice. There is certainly room for—and value in—both. Assigning some tags to your photos and favorite web sites is obviously much different than applying a taxonomy to published medical research. But if free style tagging is as big as it seems to be, how will publishers take advantage of this trend? And should they?
For example, getting most authors to properly apply structured metadata has always been a challenge (or at least not a priority amongst the many things they need to do). Would authors be more receptive to tagging? Would this add any value to the content? Is it better than nothing? Does opening up content for users/customers to tag within your electronic products make sense for your product, market or audience? How do social tags mix in with more formal taxonomy classifications? All good questions I hope we see explored in greater depth in 2006.
The point of contention in the debate is essentially, what are the advantages of a using controlled vocabulary in tagging documents and how does one control the cost of building, maintaining and classifying against these controlled vocabularies? I will briefly address both of these issues.
The word "taxonomy" has come to mean a variety of different things. Often times it simply refers to an ISO 2788 thesaurus, with broader and narrower terms. While these thesauri are limited in utility, the advantages of using a true multi-level, hierarchical taxonomy are numerous.
When using folksonomies, often times a topic that is associated with a document simply does not provide enough context. What does it mean, for example, for a document to contain the word "cancer" or even "gastrointestinal cancer" as topical associations? Is the document about Gastrointestinal cancer screening and prevention? Phases in the cell cycle? The role of the immune response?
A true hierarchical taxonomy can provide a very granular context and make browsing and searching over large document sets much more efficient.
Secondly, controlled vocabularies provide the ability to combine topical areas to produce queries that are much more efficent than simple coverage or Boolean searches. Take the following use case. The user would like to find all documents about anthrax in the context of biological warfare and not animal infection. This would not be possible with a folksonomy tagged simply with the word "anthrax".
The cost of creating and maintaining a taxonomy can be mitigated in many ways.
Many vendors, such as Fast, Verity, Inxight, Endeca and Convera, offer automatic classification of documents against a taxonomy using rule sets. The role of editors is therefore limited to simply reviewing and approving the results of the classification.
Finally, many tools have been developed over the past five years to assist in the manual creation of taxonomies. However if the cost of doing so is not appealing to an organization, one can now purchase a taxonomy "out of the box" covering a comprehensive set of taxonomies. An ever growing number of vendors now provide pre-built taxonomies, including Wand, Intellisophic and Taxonomy Warehouse.
Posted by: Michael Puscar | January 12, 2006 at 12:55 PM
A few comments regarding mitigating the cost of creating and maintaining a taxonomy ....
I agree that these are complex and resource-intensive tasks, and that it would be great to have a set of tools that would really help. But they're not out there yet.
Unfortunately, the vendors that you mention provide tools for bottom-up taxonomy building. They deal only with the "bags of words" available in each piece of content and do not take user or organizational needs into consideration. Also, they produce weird names for the taxonomic nodes.
Reviewing and approving the results are NOT simple tasks. They usually take the same effort, time and skill level associated with building a taxonomy from scratch. There are many related problems, and of course, these tools cannot be purchased independently.
The purchase of an "off the shelf" taxonomy is just as bad. I have never seen two organizations using identical taxonomies. All taxonomies are subjective, starting from the schemes of Melville Dewey and before.
In addition, most companies have content in multiple subject domains - HR, Sales, IT, etc. The "off the shelf" taxonomies are limited to single subject domains. Although they might be a good starting point for some companies in some targeted areas, "off the shelf" solutions require extensive (and expensive) customization.
Posted by: Marcia Morante | January 16, 2006 at 01:13 PM
Marcia, I agree that bottom-up taxonomy building is not viable unless the need is very small. Most projects that I have seen run out of steam after only 300 or 400 topics, either due to maintenance or the cost of using a subject area expert.
I would also agree that the "bag of words" technology is not viable. This technology is called statistical clustering, and as you mentioned, the words that a machine chooses tend not to be the words that a human would choose. It is not a good solution.
I would suggest you look at a company named "Intellisophic". They have an impressive taxonomy library with millions of topics covering a deep range of subject areas. The taxonomies are generated as derivative products from vetted, published works. They have provided taxonomies with hundreds of thousands of nodes to customers in the pharmaceutical and government space with fantastic results.
"WAND" also provides reasonably good hand-built taxonomies, particularly in the retail catalog area.
Finally, though reviewing and editing are not zero effort tasks, they are certainly a step beyond manual classification. And, if you choose a good classification engine for your company, you may just find that editorial review is not necessary at all. Many of the customers that I have worked with in the past have found this to be the case. The key is choosing a classification engine that suits your needs. Each has advantages and disadvantages.
Posted by: Michael Puscar | January 17, 2006 at 09:59 AM