Segment-based Hidden Markov Models for Information Extraction

Speaker: Zhenmei Gu

Hidden Markov models (HMMs) are powerful statistical models that have found successful applications in Information Extraction (IE). In current approaches to applying HMMs to IE, an HMM is used to model text at the document level, i.e., the entire document is modeled by an HMM. This modeling might cause undesired redundancy in extraction in the sense that more than one filler is identified and extracted. In fact, in a typical template filling IE task which expects one filler per document, having redundant extractions makes such HMM IE systems difficult to generate a structural representation of texts in a fully automatic way.

In an attempt to address the above problem of giving redundant extractions, we propose to use HMMs to model text at the segment level, where a segment is a contiguous part of a document. In our segment HMM IE approach, the extraction process consists of two steps. The first step is a segment retrieval step in which segments relevant to the extraction are identified and retrieved. The second step is an extraction step in which an HMM extractor is used to identify and extract fillers from the segments retrieved in the first step. We have proposed several methods to retrieve extraction relevant segments from unseen documents in the first step. In this talk, we will describe one such method which uses HMMs to model and retrieve segments. Experimental results show that the resulting segment HMM IE system not only achieves near zero extraction redundancy, but also has better overall extraction performance than traditional document HMM IE systems.