What is the difference between RSuite CMS and MarkLogic Server?

Publishers often ask us “What’s the difference between RSuite CMS and MarkLogic Server." Great question!  The most straightforward answer is that RSuite is a content management application and MarkLogic Server is a database. It’s that simple.

MarkLogic Server is an incredibly powerful XML repository on top of which many publishers have built fantastic applications, O’Reilly’s Safari U, Elsevier, and Congressional Quarterly, to name a few. These custom built delivery applications are just one way MarkLogic Sever has been used.

RSuite CMS is also an application built on MarkLogic Server. RSuite CMS sits on top of MarkLogic Server to leverage the native XML repository (i.e., database) and search capabilities. Without a database, RSuite would not be able to run – just like a car needs an engine. However, without a chassis, steering wheel, electronics, etc., the engine would be of little use. Therefore, think of RSuite as the ignition system and think of MarkLogic Server as the engine. Both are very important to content management.

Automating topical classification

Imagine this scenario.  You've built (or bought) and maintain a comprehensive controlled vocabulary or taxonomy in a specific subject area.  That means that you have either spent time, effort and money employing a subject area expert to meticulously create a topical hierarchy that suits the needs of your content set, or you have spent money buying and customizing one of the many off-the-shelf taxonomies that best suit your knowledge area.

The problem is, you're not even half done.  Now you need to apply this topical metadata to your ever growing content set, and invest in tools that utilize this metadata for search, retrieval and export of your content.   Did I also neglect to mention that taxonomies need maintenance to adjust for changing events in the subject area?

It might be difficult to believe, but this is just the scenario that many publishers find themselves in after plunging head first into a taxonomy project without first considering the long term resource needs of such an effort.  For large taxonomies and content sets, the manual assignment of topical metadata is a significant and ongoing resource issue.  As such, many publishers are turning to automatic classification technology to automate their classification needs. 

There are two core choices for classification technology, rules-based and statistical matching.

Rules-based classification uses simple Boolean rules to assign topical metadata.  This is often a simplistic rule such as whether a series of words, chosen by an editor or matching on the title alone, are contained anywhere in the given piece of content.  This can sometimes be enhanced by creating additional rules about how frequently the term must appear in the content, normalized against the content size.

Statistical classification uses a vector of terms that are chosen by an off-the-shelf technology component that are statistically relevant based upon a series of training documents.  This has the advantage of providing a specific context to the matching criteria.  Some taxonomy providers provide their controlled vocabularies with the statistical matching rules already built-in.

Both statistical and rules-based classification engines use user defined thresholds as the final deciding factor about whether or not a piece of topical metadata will be assigned.  Some also use ISO 2788 thesauri to expand matching terms to synonyms, related, and broader/narrower terms.

So which method is right for you? 

Rules-based classification is remarkably accurate for scientific, medical and pharmacological taxonomies that have very distinct matching terms.  Topic titles like "Abscisic Acid" or "Hydroxypalmitate" which are contained in content are very likely to be a good topical match.  The downsides to using rules-based classification are the time investment necessary to create the matching rules and the false negatives, or missed assignments, due to very rigid matching terms.

While imperfect, statistical classification is a better fit for general knowledge, historical, current events and financial taxonomies.  Topic titles like "Aaron" could refer to the biblical Aaron, Hank Aaron, Aaron Spelling or even Aaron's Bicycle Repair!   Statistical classification uses the context provided from the original training set to tailor results very specifically to your content set.  This technology is likely to produce false positives, but if tuned properly can be editor-assisted and effective.

Though research has been ongoing for decades in this area, the explosion of digital content and the growing power of CPUs has led to a renaissance of sorts in this technology area.  We can only expect the technology to get better with greater investment and wider use.

XML as a directed graph

In the past few years I've heard the same revelation from software engineers at many different customers and technology firms: "Content isn't really hierarchical, it's a directed graph!" Have to say, that's something I rarely heard back in the good old SGML days.

If you aren't from a computer science or math background, you might think getting all excited about directed graphs is just so much mathematical geekery. It's not. One reason XML has triumphed where SGML didn't is the application of just such knowledge. 

The directed graph revelation means that sometimes content is best modeled neither as trees nor as relational tables, but as something with the attributes of both trees and tables - like a network or web where the relationships among objects can go in many directions. Particular representations of content might be simple trees or related tables, but, in the abstract, that isn't good enough. There are real-world reasons to care about this when developing software. It's also one reason that the question we've all grappled with when modeling content - "What's data and what's a document?" - is often a red herring.

There are plenty of useful sources on this topic on the web if you're interested in more detail. Wikipedia has a user-friendly description of directed graphs. Web services implementations often involve modeling the relationships among business objects as graphs.

And a Google search for "directed graph XML" turns up a 1996 discussion in which James Clark (a computer scientist who is personally responsible for much of the early adoption of SGML and XML through the development of free, really useful software - oh, and also partly responsible for XML itself) reveals (unsurprisingly) that he got this concept long before the computer scientists currently telling me about their revelations.

Microsoft and OCA, content formats for digital libraries

If you haven't heard, Microsoft has joined with the Open Content Alliance (OCA). The OCA is the creation of Yahoo and the nonprofit Internet Archive, and Microsoft's participation is drawing more attention to how the OCA's efforts to digitize content contrast with Google's. This week CNET published an article that nicely articulates the differences in approach. The OCA has taken a less contentious route than Google by limiting their effort to those works that are in the public domain, unless the copyright owner has given permission that they be included. (While you're reading the CNET story, check out the cool "Big Picture" feature on their site. It provides a graphical means to navigate CNET content by several different subject areas. Think "Semantic Web".)

I haven't found much information about the digital content format to be used by Google or OCA. Adobe is part of OCA, and at least some OCA content will be stored as PDF (with fulltext in the background for search?). OCA will also be gathering multimedia content as well as print sources. If you go to the Google Print site, you'll see that book pages appear to be captured as images. It would be especially interesting to know what kind of metadata is (and isn't) being captured by both groups. For example, it seems obvious that the Google system doesn't "know" when two books are actually the same classic (say, Shakespeare's Macbeth) published by different publishers. (And I'm not suggesting it makes business sense for Google to capture this information.) In fact, if you search for "Shakespeare Macbeth" on the site, you get more than 32,000 results, and the versions of the play Macbeth don't all float to the top - a book on Kurosawa is in position 5 (today). While to a particular reader this might not matter too much, it certainly will be of interest to the publisher. (Simon and Schuster's Macbeth is in position #1, but Kessinger's doesn't show up until the 3rd page of results.) Can the publisher influence this in any way? Should they be able to?

Site Feed

About this Blog

This blog is produced by the consultants and analysts from Really Strategies, a content solutions and services provider.

A Content Management System for Publishers

Search This Blog

Lijit Search

Browse Archives

Browse a list of posts by author.