Mechanism of analysis of similarity short texts, based on the Levenshtein distance

Artur Niewiarowski, Marek Stanuszek

Abstract


This paper presents the proposal of text mining mechanism based on Levenshtein Distance Algorithm (LDA), which effectively detect the similarity of different length words. This algorithm for similarity analysis of sentences is used and successfully detects similarities between single sentences. Mechanism is characterized by speed of data analysis and simplify of implementation.

Keywords


Natural Language Processing (NLP); Natural Language Understanding (NLU); Data Mining; Text Mining; Levenshtein Distance Algorithm

Full Text:

PDF (Polski)

References


Manning C. D., Prabhakar R., Hinrich S.: Introduction to Information Retrieval. Cambridge University Press, 2008.

Beeferman D., Berger A., Lafferty J.: Statistical models for text segmentation. Mach. Learn., Vol. 34(1÷3), 1999, s. 177÷210.

Lin D.: Automatic retrieval and clustering of similar words. COLING 1998, ACL, 1998, s. 768÷774.

Левенштейн В.И.: Двоичные коды с исправлением выпадений, вставок и заме-щений символов. Доклады Академий Наук СCCP 163 (4), 1965, s. 845÷848.

Chakrabarti S.: Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan Kaufmann, 2002.

Hamming R. W.: Error Detecting and Error Correcting Codes. The Bell System Technical Journal, Vol. XXIX, April, 1950.

Christos H. Papadimitriou.: Złożoność obliczeniowa. Helion, Gliwice 2012.




DOI: http://dx.doi.org/10.21936/si2013_v34.n1.9