Paragraph Context Determination Through Rhetorical Figures - A Practical Approach Using Epanaphora

Speaker: Claus Strommer

In this study we examine the applications of rhetorical figures in natural language processing. We shall focus on rhetorical anaphora (epanaphora). Since epanaphora is simply the repetition of words at the beginning of sentences, phrases, or paragraphs, it can be parsed with minimal machine error.For the purpose of this research we use the TREC '06 Blog corpus, chosen over a pre-parsed corpus like Penn Treebank because its larger size encompasses a greater variety of styles. Furthermore, due to it being more recent, and because of the sources used, it is more representative of modern, common-day prose.Detection and recording of epanaphora only involves finding well-delimited repetitions. Classification, however, is more challenging. We created three main categories into which we aggregate the found instances of epanaphora: Accidental, designed-intentional, and brute-intentional. We show how to identify and use the markers that let us classify instances of epanaphora into these categories.Once the found instances of epanaphora have been classified, we demonstrate how the classification criteria correlate to the context of the sentences. We show that distinct author goals influence the type of epanaphora used, and that different types of epanaphora may be used as indicators of writing style and author intention.We conclude that epanaphora are a cost-effective method for classifying the context of paragraphs and that this method can be used to complement other natural language processing techniques. We furthermore infer that our classification methods can be applied to other rhetorical figures, and that they are potentially language-agnostic.