The Open Information Systems Journal

2009, 3 : 54-67
Published online 2009 July 23. DOI: 10.2174/1874133900903010054
Publisher ID: TOISJ-3-54

Term-Centric Active Learning for Naïve Bayes Document Classification

Sunghwan Sohn , Donald C. Comeau and W. John. Wilbur
Biomedical Statistics and Informatics, Health Sciences Research, Mayo Clinic, Rochester, MN.

ABSTRACT

In real world document classification, a subset of documents often needs to be chosen for labeling as a training set for a machine learner. Random sampling is generally not the most effective approach for choosing documents to be labeled. Active learning selects useful examples for labeling to improve the efficiency of learning. We consider two factors in order to measure the usefulness of a document for labeling. Such a document should be 1) largely unknown to the current learner 2) influential by being close to many other documents. These factors are stated from a document-centric viewpoint. A similar analysis can be made from a term-centric viewpoint. It is the purpose of this paper to present this term-centric approach to active learning using a naïve Bayes classifier. We study both document-centric and our new term-centric active learning methods. We find good performance of the term-centric methods on numerous data sets with different characteristics. In addition, a genetic algorithm is employed to compare our results with estimated optimal performance at fixed training set size and our results are between 84% and 99% of the estimated optimum.

Keywords:

Active learning, genetic algorithm, naïve Bayes classifier, pool adjacent violators algorithm, genetic algorithm, uncertainty sampling.