The mechanism of identification and classification of content

Artur Niewiarowski, Marek Stanuszek


This paper presents the mechanism of identification and classification of content, based on terms weighted method with inversed document frequency analysis and Levenstein distance technique. The proposed mechanism is applied in the analysis of topics and descriptions of selected diploma thesis, to automatic selection of supervisors and reviewers.


keywords extraction; inversed document frequency; term frequency; Levenshtein distance; text mining; database mining

Full Text:

PDF (Polski)


Manning C. D., Prabhakar R., Hinrich S.: Introduction to Information Retrieval. Cambridge University Press, 2008.

Beeferman D., Berger A., Lafferty J.: Statistical models for text segmentation. Mach. Learn., Vol. 34(1-3), 1999, s. 177÷210.

Lin D.: Automatic retrieval and clustering of similar words. COLING 1998, ACL, 1998, s. 768÷774.

Левенштейн В. И.: Двоичные коды с исправлением выпадений, вставок и заме-щений символов. Доклады Академий Наук СCCP, 163 (4), 1965, s. 845÷848.

Piasecki M., Broda B.: Semantic similarity measure of Polish nouns based on linguistic features. Business Information Systems 10th International Conference, Poznań, Lecture Notes in Computer Science, Vol. 4439, Springer, 2007.

Robertson S.: Understanding Inverse Document Frequency: On theoretical arguments for IDF. Journal of Documentation, Vol. 60, No. 5, 2004, s. 503÷520.

Hamming R. W.: Error Detecting and Error Correcting Codes. The Bell System Technical Journal, Vol. XXIX, 1950.

Witten I. H., Paynter G. W., Frank E., Gutwin C., Vevill-Manning C. G.: KEA: practical automatic keyphrase extraction. DL’99 Proceedings of the fourth ACM conference on Digital libraries, 1999.

Lawrie D., Croft W. B., Rosenberg A.: Finding topic words for hierarchical summarization. SIGIR ‘01 Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, 2001.

Ventura J., da Silva J. F.: Ranking and Extraction of relevant Single Words in Text. InTech, August, 2008.

Sarkar K., Nasipuri M., Ghose S.: A new Approach to Keyhprase Extraction Using Neural Networks. International Journal of Computer Science Issues, Vol. 7, Issue 2, No. 3, 2010.

Novay L. G., Novay Ch. W., Brussee R.: Thesaurus Based Term Ranking for Keyword Extraction. DEXA ‘10 Proceedings of the 2010 Workshops on Database and Expert Systems Applications, Computer Society, 2010.