Cluster analysis and dimensionality reduction in a hierarchical corpus model

Jan Wicijowski, Bartosz Ziółko


The article presents a semantic model of the polish language based on the polish Wikipedia texts. The model is a part of an automatic speech recognition system and verifies sentences hypotheses. Methods of filtering and clustering of the documents, which aim to accelerate the computations, are presented. The authors emphasize the delegation of the processing tasks to the database engine, where it is possible to gain the performance.


cluster analysis; vector space model; dokument-term matrix; sparse matrix; sqlite3; Wikipedia; mediawiki

