« Using powerpoint | Main | TFPL abroad »

Thinking about a spring clean?

A client, embarking on the roll out of a new Content Management System in 2007, asked me for an opinion on the potentially thorny problem of classifying a large set of existing content.

The answer is, “well, there are a many ways to attack it.”  So I thought I’d share my points of view on it:

Manual classification

  • requesting the author classifies their content gives you a high degree of accuracy, but often it is a subjective set of tags (the author knows what they were thinking when they wrote the document, but might not consider wider tags which are equally applicable to the content).
  • employing Information Scientists (directly or outsourcing to a group like TFPL’s Information Service) to read appraise and tag the document – This could be a useful approach if you want other metadata to be created, for example a summary, abstract or headline where there is some skill in creating those new meta items.
  • employing a team of classifiers to train on a specific taxonomy and to apply this to content. If the volumes are not huge and this is a one off task, providing some temporary contactors to plough through a document might be low tech, but could be the best option.

For all of the above elements the size of the taxonomy can be a limiting factor.  Any structure over 200 nodes and the lower level nodes are unlikely to be used.

Automatic classification

For large content sets or for multiple and or large taxonomy structures, an automatic classification system might be the best approach. However, the trade off is typically the effort required “up front” to develop the tagging rules.

There are a number of software applications which in one form or another build up rule sets to apply to the language of a document and return one or more tags (dependant on the setting of a threshold value).  This process can include:

  • Training sets: compare a positive set (documents you know are about the subject), to negative set (a random control set) and generate the linguistic rules.  The effort involved in creating the initial documents sets is non-trivial. 
  • Manual rule’s bases – requires expertise in the rule language and application
  • Machine learning systems – needs monitoring and tailoring over time to improve accuracy.
  • Thesaurus driven systems – using keyword relationships to preferred terms with an algorithm to create a rule base. Can be set up reasonably easy, but will need tailoring for complex and ambiguous language  (which English  seems to be littered with wen you get into this topic!)

However, for some of the known taxonomies, like the UK government sponsored Integrated Public Sector Vocabulary (IPSV)  there are off-the-shelf classification systems that can be employed directly.

The type and format of the content is also a major factor:

  • If your reports follow a standard template then you have a much improved chance getting a successful outcome with rule based classifications (knowing the first section is always an Executive Summary means that more weighting can be applied to the content there than the content in the Appendices, for example).
  • Large complex documents can have many subjects and while a classifier could apply many tags it might be better to create distinct sections that have a more focussed set of tags, depending on the desired application of the content.

It's certainly a challenge, but there are options out there and the improvement in search and retrieval with well classified content set is a worthwhile benefit.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/856429/7161588

Listed below are links to weblogs that reference Thinking about a spring clean?:

Comments

Post a comment

Comments are moderated, and will not appear on this weblog until the author has approved them.

If you have a TypeKey or TypePad account, please Sign In