Cluster analysis and dimensionality reduction in a hierarchical corpus model

Jan Wicijowski, Bartosz Ziółko


The article presents a semantic model of the polish language based on the polish Wikipedia texts. The model is a part of an automatic speech recognition system and verifies sentences hypotheses. Methods of filtering and clustering of the documents, which aim to accelerate the computations, are presented. The authors emphasize the delegation of the processing tasks to the database engine, where it is possible to gain the performance.


cluster analysis; vector space model; dokument-term matrix; sparse matrix; sqlite3; Wikipedia; mediawiki

Full Text:

PDF (Polski)


Ziółko B., Manandhar S., Wilson R.C.: Bag-of-words modelling for speech recognition. 2009 International Conference on Future Computer and Communication. ICFCC 2009, Kwiecień 2009, s. 646-650.

Salton G.: Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1989.

Salton G., Buckley C: Term-weighting approaches in automatic text retrieval. Information Processing and Management, 1988, s. 513-523.

Jones E., Oliphant T., Peterson P. SciPy: Open source scientific tools for Python. SciPy Documentation: Sparse matrices.

Martinez W.L., Martinez A.R.: Exploratory Data Analysis with MATLAB (Computer Science and Data Analysis). Chapman & Hall/CRC, 2004.

Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R.: Indexing by latent semantic analysis. Journal of the American Society For Information Science, 41, 1990.

Kohonen T.: Self-Organizing Maps. Springer-Verlag, Berlin 1995/1997.

Ntoulas A., Cho J., Olston C: What’s new on the web? : the evolution of the web from a search engine perspective. WWW ‘04: Proceedings of the 13th intemational conference on World Wide Web, New York, NY, USA, 2004. ACM, s. 1-12.

The Mathworks. Matlab Code Vectorization Guide. http://www.mathworks.eom/support/tech-notes/1100/1109.html

Jones E., Oliphant T., Peterson P. SciPy: Open source scientific tools for Python. SciPy Documentation: A beginners guide to using Python for performance computing.