Speaker: Fei Song (University of Guelph)

Text categorization is the process of assigning predefined categories to textual documents. As the exponential growth of web pages and online documents continues, there is an increasing need for systems that automatically classify text into proper categories. Text categorization is the foundation for any web content filtering system since only when we know that the categories of certain web pages are offensive or inappropriate can we further block them from user access.

Many statistical machine learning methods have been successfully applied to text categorization, including Naive Bayes, k-Nearest Neighbors, Multi-layer Neural Networks, Linear Least Square Fit, and Support Vector Machines. In this talk, we focus on the use of Maximum Entropy Modeling for text categorization. Maximum entropy provides a reasonable way of estimating probability distributions from the training data. The key principle is that when nothing is known about certain features, the distribution for them should be as uniform as possible (thus the maximum entropy). Maximum Entropy Modeling has a desirable property in that it allows for dependency relationships among features by setting the weighting parameters appropriately.

When representing documents by features, not all keywords are equally useful: some commonly-used words may reduce the distinctions among the documents, and some rare words may only affect the classification of a small number of documents. Experiments have shown that feature reduction can not only reduce the computational cost but also help improve the classification performance. We compare seven different feature selection methods, including Document Frequency, X2 Ranking, Count Difference, Likelihood Ratio, Optimal Orthogonal Centroid, Term Discrimination, and Information Gain, and one feature extraction method based on Probabilistic Latent Semantic Analysis.

To demonstrate the effectiveness of Maximum Entropy Modeling for text categorization, we conduct experiments on RCV1 (Reuters Corpus Volume I) data set. Our results show that when cutting the number of features to 1000-2000 level, we can get good classification performance. In particular, Document Frequency, Count Difference, and Optimal Orthogonal Centroid have been shown to outperform the other feature reduction methods. Our results also show that Maximum Entropy Modeling is a competitive method for text categorization: its performance is better than that for k-Nearest Neighbors and Naive Bayes, although still not as good as that for Support Vector Machines.