Imagine this scenario. You've built (or bought) and maintain a comprehensive controlled vocabulary or taxonomy in a specific subject area. That means that you have either spent time, effort and money employing a subject area expert to meticulously create a topical hierarchy that suits the needs of your content set, or you have spent money buying and customizing one of the many off-the-shelf taxonomies that best suit your knowledge area.
The problem is, you're not even half done. Now you need to apply this topical metadata to your ever growing content set, and invest in tools that utilize this metadata for search, retrieval and export of your content. Did I also neglect to mention that taxonomies need maintenance to adjust for changing events in the subject area?
It might be difficult to believe, but this is just the scenario that many publishers find themselves in after plunging head first into a taxonomy project without first considering the long term resource needs of such an effort. For large taxonomies and content sets, the manual assignment of topical metadata is a significant and ongoing resource issue. As such, many publishers are turning to automatic classification technology to automate their classification needs.
There are two core choices for classification technology, rules-based and statistical matching.
Rules-based classification uses simple Boolean rules to assign topical metadata. This is often a simplistic rule such as whether a series of words, chosen by an editor or matching on the title alone, are contained anywhere in the given piece of content. This can sometimes be enhanced by creating additional rules about how frequently the term must appear in the content, normalized against the content size.
Statistical classification uses a vector of terms that are chosen by an off-the-shelf technology component that are statistically relevant based upon a series of training documents. This has the advantage of providing a specific context to the matching criteria. Some taxonomy providers provide their controlled vocabularies with the statistical matching rules already built-in.
Both statistical and rules-based classification engines use user defined thresholds as the final deciding factor about whether or not a piece of topical metadata will be assigned. Some also use ISO 2788 thesauri to expand matching terms to synonyms, related, and broader/narrower terms.
So which method is right for you?
Rules-based classification is remarkably accurate for scientific, medical and pharmacological taxonomies that have very distinct matching terms. Topic titles like "Abscisic Acid" or "Hydroxypalmitate" which are contained in content are very likely to be a good topical match. The downsides to using rules-based classification are the time investment necessary to create the matching rules and the false negatives, or missed assignments, due to very rigid matching terms.
While imperfect, statistical classification is a better fit for general knowledge, historical, current events and financial taxonomies. Topic titles like "Aaron" could refer to the biblical Aaron, Hank Aaron, Aaron Spelling or even Aaron's Bicycle Repair! Statistical classification uses the context provided from the original training set to tailor results very specifically to your content set. This technology is likely to produce false positives, but if tuned properly can be editor-assisted and effective.
Though research has been ongoing for decades in this area, the explosion of digital content and the growing power of CPUs has led to a renaissance of sorts in this technology area. We can only expect the technology to get better with greater investment and wider use.
Comments