Representations of text documents in context of SPAM detection in polish with english phrases

Piotr Andruszkiewicz


Representation of text documents should be as small as possible and give high accuracy of classification. This paper presents representations of text documents and ways of their reduction in case of SPAM detection in Polish with English phrases.


text document representation; term weighting functions; TF-IDF; reduction of text document representation; classification; SPAM detection

Full Text:

PDF (Polski)


Barnbrook G.: Defining Language: A Local Grammar of Definition Sentences. John Benjamins 2002.

Bole L., Cytowski J.: Modern Search Methods. Instytut Podstaw Informatyki PAN, Warszawa 1992.

Boratyński D.: Metody klasyfikacji dokumentów tekstowych w języku polskim. W: Wyzwania gospodarki elektronicznej - stan i perspektywy, Red. Tadeusz Grabiński, WSPiM, Chrzanów 2005.

Chakrabarti S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco 2003.

Chrabąszcz M., Gołębski M., Bembenik R: Metody klasyfikacji dokumentów tekstowych. Informatyka Teoretyczna i Stosowana, Wydawnictwo Politechniki Częstochowskiej, nr 3, Częstochowa 2002, s. 89-100.

Church K. W., Gale W. A.: Inverse document frequency (IDF): A measure of deviations from Poisson. Proceedings of the Third Workshop on Very Large Corpora (WVLC), s. 121-130, 1995.

Fang H., Tao T., Zhai C: A formal study of information retrieval heuristics. Proceedings ofSIGIR, 2004, s. 49-56.

Fawcett T.: In vivo spam filtering: a challenge problem for KDD. SIGKDD Explorations 5(2), 2003, s. 140*148.

Feldman R., Sanger J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, New York 2007.

Gawrysiak P.: Automatyczna klasyfikacja dokumentów, Praca Doktorska, 2001.

Hastie T., Tibshirani R., Friedman J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York 2001.

Jain A., Murty M., Flynn P.: Data Clustering: A Review. ACM Computing Surveys, 31(3), wrzesień 1999.

Kroon de H., Mitchell T., Kerckhoffs E.: Improving learning accuracy in information filtering. International Conference on Machine Learning - Workshop on Machine Learning Meets HCI (ICML-96), 1996.

Liu R., Lu Y.: Incremental context mining for adaptive document classification. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002, s. 599-604.

Mani I., Maybury M. T.: Advances in Automatic Text Summarization. MIT Press 2001.

Ponte J. M., Croft W. B.: A Language Modeling Approach to Information Retrieval. Research and Development in Information Retrieval, 1998, s. 275-281.

Porter M. F.: An Algorithm for Suffix Stripping, Program, 14(3), 1980, s. 130-137.

Rijsbergen Van C. J.: Information Retrieval. Dept. of Computer Science, University of Glasgow, 1979.

Robertson S. E.: Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 60(5), 2004, s. 503-520.

Robertson S. E., Jones K. S.: Simple proven approaches to text retrieval. Tech. Rep. TR356, Cambridge University Computer Laboratory, 1997.

Salib M., Sheer M.: Spam Classification with Naive Bayes and Smart Heuristics, 2002.

Saul L., Pereira F.: Aggregate and Mixed-Order Markov Models for Statistical Language Processing, Association for Computational Linguistics, New Jersey 1997.

Scime A.: Web Mining: Applications and Techniques. Idea Group Inc (IG1) 2005.

Shakeri S., Rosso P.: Spam Detection and Email Classification. Information Assurance and Computer Security, IOS Press, 2006.

Song F., Croft W. B.: A General Language Model for Information Retrieval (poster abstract). Eighth International Conference on Information and Knowledge Management (CIKM'99), 1999.

Sparck Jones K.: IDF term weighting and IR research lessons. Journal of Documentation 60, 2004, s. 521-523.

Stefanowski J., Zienkowicz M.: Classification of Polish Email Messages: Experiments with Various Data Representations. ISMIS 2006, s. 723-728.

Willett P.: Recent Trends in Hierarchic Document Clustering: A Critical Review. Information Processing and Management, 24(5), 1988, s. 577-597.

Youn S., McLeod D.: Efficient Spam Email Filtering using Adaptive Ontology. International Conference on Information Technology (ITNG'07), 2007, s. 249-254.

Zorkadis V., Karras D. A., Panayotou M.: Efficient information theoretic strategies for classifier combination, feature extraction and performance evaluation in improving false positives and false negatives for spam e-mail filtering. Neural Networks 18(5-6), 2005, s. 799-807.