Naive Bayes modelling with proper smoothing for information extraction

Speaker: Zhenmei Gu

Information Extraction (IE) summarizes a collection of textual documents into a structual representation by identifying specific facts from text. The naive Bayes model is one of the first statistical models that have been applied to IE for learning extraction patterns from labelled data. In spite of the simplicity and the popularity of the naive Bayes model, we have observed a formulation problem in previous work on naive Bayes IE. In this talk, we present a formal naive Bayes modelling for IE, by which the induced formula for the filler probability estimation is more theoretically sound. We also address smoothing techniques in naive Bayes IE in order to overcome the data sparseness problem. Our proposed smoothing method, based on unseen species estimation and proper allocation of a total unseen probability among all unseen events, is shown to be critical to the robustness of a naive Bayes IE system. Several other issues in designing a naive Bayes IE system will also be discussed. Expreimental results show that our naive Bayes IE systems achieve better extraction performance compared to previous work on naive Bayes IE.