Thinking about a spring clean?
A client, embarking on the roll out of a new Content Management System in 2007, asked me for an opinion on the potentially thorny problem of classifying a large set of existing content.
The answer is, “well, there are a many ways to attack it.” So I thought I’d
share my points of view on it:
Manual classification
- requesting the author classifies
their content gives you a high degree of accuracy, but often it is a subjective
set of tags (the author knows what they were thinking when they wrote the
document, but might not consider wider tags which are equally applicable to the
content).
- employing Information Scientists
(directly or outsourcing to a group like TFPL’s Information Service) to read appraise and tag
the document – This could be a useful approach if you want other metadata to be
created, for example a summary, abstract or
headline where there is some skill in creating those new meta items.
- employing a team of classifiers to
train on a specific taxonomy and to apply this to content. If the volumes are
not huge and this is a one off task, providing some temporary contactors to
plough through a document might be low tech, but could be the best
option.
For large content sets or for
multiple and or large taxonomy structures, an automatic classification system
might be the best approach. However, the trade off is typically the effort
required “up front” to develop the tagging rules.
There are a number of software
applications which in one form or another build up rule sets to apply to the
language of a document and return one or more tags (dependant on the setting of
a threshold value). This process can include:
- Training sets: compare a positive
set (documents you know are about the subject), to negative set (a random control
set) and generate the linguistic rules. The effort involved in creating the
initial documents sets is non-trivial.
- Manual rule’s bases – requires
expertise in the rule language and application
- Machine learning systems – needs
monitoring and tailoring over time to improve
accuracy.
- Thesaurus driven systems – using
keyword relationships to preferred terms with an algorithm to create a rule
base. Can be set up reasonably easy, but will need tailoring for complex and ambiguous language (which English seems to be littered with wen you get into this topic!)
However, for some of the known
taxonomies, like the UK government sponsored Integrated Public Sector Vocabulary (IPSV) there are off-the-shelf classification systems that can be
employed directly.
The type and format of the content
is also a major factor:
- If your reports follow a standard
template then you have a much improved chance getting a successful outcome with rule based classifications
(knowing the first section is always an Executive Summary means that more
weighting can be applied to the content there than the content in the
Appendices, for example).
- Large complex documents can have many subjects and while a classifier could apply many tags it might be better to create distinct sections that have a more focussed set of tags, depending on the desired application of the content.
It's certainly a challenge, but there are options out there and the improvement in search and retrieval with well classified content set is a worthwhile benefit.

Comments