Segment-based Hidden Markov Models for Information Extraction
Speaker: Zhenmei Gu
Hidden Markov models (HMMs) are powerful statistical models that have
found successful applications in Information Extraction (IE). In current
approaches to applying HMMs to IE, an HMM is used to model text at the
document level, i.e., the entire document is modeled by an HMM. This
modeling might cause undesired redundancy in extraction in the sense that
more than one filler is identified and extracted. In fact, in a typical
template filling IE task which expects one filler per document, having
redundant extractions makes such HMM IE systems difficult to generate a
structural representation of texts in a fully automatic way.
In an attempt to address the above problem of giving redundant
extractions, we propose to use HMMs to model text at the segment level,
where a segment is a contiguous part of a document. In our segment HMM IE
approach, the extraction process consists of two steps. The first step is
a segment retrieval step in which segments relevant to the extraction are
identified and retrieved. The second step is an extraction step in which
an HMM extractor is used to identify and extract fillers from the segments
retrieved in the first step. We have proposed several methods to retrieve
extraction relevant segments from unseen documents in the first step. In
this talk, we will describe one such method which uses HMMs to model and
retrieve segments. Experimental results show that the resulting segment
HMM IE system not only achieves near zero extraction redundancy, but also
has better overall extraction performance than traditional document HMM IE
systems.