Choosing a persistent storage for data mining task

Paweł Kasprowski


The amount of data available for mining or machine learning is increasing. Therefore one of the main problems of nowadays mining is decision how to persistently store that data in the way that it is easy and fast to load and save by mining algorithms. When data is too big to fit in the memory, there are two common ways to handle it: text or binary file in own format or ready-to-use universal database engine. Both have advantages and disadvantages. As for database engine, the most popular storage is a relational database server. Recently another promising option became non-relational databases like document-oriented databases. The work presented in this paper analyses how different storages behave for big amounts of data. Experiments compare efficiency of these storages for some classic mining tasks


data mining; persistent storage; document-oriented database

Full Text:



Anderson J.: CouchDB. The Definite Guide. O’Reilly, 2010.

Apache Jackrabbit,

Caraciolo M.: Map Reduce with MongoDB and Python. Artifical Intelligence in Motion blog,

Codd E. F.: A relational model of data for large shared data banks. Commun. ACM, 1970.

Deshpande A., Madden S.: MauveDB: Supporting Model-based User Views in Database Systems. SIGMOD, 2006.

Džeroski S., Lavrač N.: Relational data mining. Springer Verlag, Berlin, Heidelberg 2001.

Extensible Markup Language (XML), W3C domain,

Hand D. J., Mannila H., Smyth P.: Principles of Data Mining. MIT Press, 2002.

Holsheimer M., Kersten M., Mannila H., Toivonen H.: A Perspective on Databases and Data Mining. KDD, 1995.

Imieliński T., Virmani A.: MSQL: A Query Language for Database Mining. Data Mining and Knowledge Discovery, Kluwer Academic Publishers, 1999.

Introducing JSON (Javasript Object Notation),

JSON in Java,

Kasprowski P., Ober J.: Eye movements in biometrics. Biometric Authentication Workshop, European Conference on Computer Vision ECCV’2004, Lecture Notes in Computer Science, Springer, Prague 2004.

Meo R., Psaila G., Ceri S.: A New SQL-like Operator for Mining Association Rule. Proceedings of the 22nd VLDB Conference, India, 1996.

MongoDB document-oriented storage,

MySQL web page,

NOSQL Databases,

Oram A.: MongoDB experts model the move from a relational database to MongoDB. O’Reilly Community,

Ordonez C., Pitchaimalai S. K.: Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling. Data & Knowledge Engineering, Vol. 69, No. 5, 2010, p. 383÷398.

Ordonez C.: Integrating K-Means Clustering with a Relational DBMS Using SQL. IEEE Transactions On Knowledge And Data Engineering, Vol. 18, No. 2, 2006.

Ordonez C.: Programming the Kmeans Clustering Algorithm in SQL. KDD, 2004.

PostgreSQL database system web page,

Raedt L. D.: A perspective on inductive databases. ACM SIGKDD Explorations Newsletter, 2002.

Sarawagi S., Thomas S., Agrawal R.: Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD, 1998.

Suresh L., Simha J. B.: Novel and Efficient Clustering Algorithm Using Structured Query Language. Proceedings of the 2008 International Conference on Computing, Communication and Networking, 2008.

Vicknair C.: A Comparison of a Graph Database and a Relational Database, ACM Proceedings of the 48th Annual Southeast Regional Conference, New York 2010.

Zhang T., Ramakrishnan R., Livny M.: BIRCH: An Efficient Data Clustering Method for Very Large Databases SIGMOD, 1996.