Maximum Entropy Modeling with Feature Reduction for Text Categorization
Speaker: Fei Song (University of Guelph)
Text categorization is the process of assigning predefined categories
to textual documents. As the exponential growth of web pages and
online documents continues, there is an increasing need for systems
that automatically classify text into proper categories. Text
categorization is the foundation for any web content filtering system
since only when we know that the categories of certain web pages are
offensive or inappropriate can we further block them from user access.
Many statistical machine learning methods have been successfully
applied to text categorization, including Naive Bayes, k-Nearest
Neighbors, Multi-layer Neural Networks, Linear Least Square Fit, and
Support Vector Machines. In this talk, we focus on the use of Maximum
Entropy Modeling for text categorization. Maximum entropy provides a
reasonable way of estimating probability distributions from the
training data. The key principle is that when nothing is known about
certain features, the distribution for them should be as uniform as
possible (thus the maximum entropy). Maximum Entropy Modeling has a
desirable property in that it allows for dependency relationships
among features by setting the weighting parameters appropriately.
When representing documents by features, not all keywords are
equally useful: some commonly-used words may reduce the distinctions
among the documents, and some rare words may only affect the
classification of a small number of documents. Experiments have shown
that feature reduction can not only reduce the computational cost but
also help improve the classification performance. We compare seven
different feature selection methods, including Document Frequency, X2
Ranking, Count Difference, Likelihood Ratio, Optimal Orthogonal
Centroid, Term Discrimination, and Information Gain, and one feature
extraction method based on Probabilistic Latent Semantic Analysis.
To demonstrate the effectiveness of Maximum Entropy Modeling for
text categorization, we conduct experiments on RCV1 (Reuters Corpus
Volume I) data set. Our results show that when cutting the number of
features to 1000-2000 level, we can get good classification
performance. In particular, Document Frequency, Count Difference, and
Optimal Orthogonal Centroid have been shown to outperform the other
feature reduction methods. Our results also show that Maximum Entropy
Modeling is a competitive method for text categorization: its
performance is better than that for k-Nearest Neighbors and Naive
Bayes, although still not as good as that for Support Vector Machines.