Enhancing Text Clustering using Concept-based Mining Model
Speaker: Shady Shehata
Most of text data mining techniques are based on either a word
analysis or phrase analysis of the text. The statistical analysis of a
term (word or phrase) frequency captures the importance of the term
within a document. However, to achieve a more accurate analysis, the
underlying mining technique should indicate terms that capture the
semantics of the text from which the importance of a term in a
sentence and in the document can be derived. A new concept-based
mining model that relies on the analysis of both the sentence and the
document, rather than, the traditional analysis of the document
dataset only is introduced. The proposed mining model consists of a
concept-based analysis of terms and a concept-based similarity
measure. The term which contributes to the sentence semantics is
analyzed with respect to its importance at the sentence-level and the
document-level. The model can efficiently find significant matching
terms, either words or phrases, of the documents according to the
semantics of the text. The similarity between documents relies on a
new concept-based similarity measure which is applied to the matching
terms between documents. Experiments using the proposed concept-based
term analysis and similarity measure in text clustering are conducted.
Experimental results demonstrate that the newly developed
concept-based mining model enhances the clustering quality of sets of
documents substantially.