Clustering collections of XML documents having different structure types

Michał Kozielski


The paper presents comparison of application of several clustering algorithms and XML structure encoding methods to clustering XML documents having different structure types. Quality of the clustering is evaluated regarding the application of the resulting partitions to acceleration of the selective queries execution on XML collections. The results show that application of multilevel clustering algorithm to analysis of XML documents having complex structure gives the partition of better quality.


clustering; XML documents clustering

Full Text:



Bairoch A., Apweiler R., Wu C. H., Barker W. C, Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M. J., Natale D. A., O'Donovan C, Redaschi N., Yeh L. S.: The Universal Protein Resource (UniProt), Nucleic Acids Res. 33: D154-D159, (2005).

Bouchon-Meunier B., Rifqi M., Bothorel S.: Towards general measures of comparison of objects, Fuzzy Sets and Systems, 1996, Vol. 84, p. 143-153.

Bourret R.: XML and Databases,, (2005).

Bray T., Paoli J., Spcrberg-McQueen C. M., Maler E., Yergeau F. (ed.): Extensible Markup Language (XML) 1.0 (Fourth Edition), W3C Recommendation 16 August 2006, edited in place 29 September 2006,

Ceravolo P., Nocerino M. C, Viviani M.: Knowledge Extraction from Semi-structured Data Based on Fuzzy Techniques, Knowledge-Based Intelligent Information and Engineering Systems, Lecture Notes in Computer Science, 2004, Vol. 3215, p. 328-334.

Denoyer L., Galliari P.: Dataset used in the experiment, (2006).

Ester M., Kriegel H. P., Sander J., Xu X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD'96), 1996, p. 226-231.

Flesca S., Manco G, Masciari E., Pontieri L., Pugliese A.: Fast Detection of XML Structural Similarity, IEEE Transactions on Knowledge and Data Engineering, 2004, Vol. 17, No. 2, p. 160-175.

Han J., Kamber M.: Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, Academic Press, San Francisco 2001.

Hand D., Mannila H, Smyth P.: Principles of Data Mining, WNT, Warszawa, 2005.

Jain A. K., Murty M. N., Flynn P. J.: Data Clustering: A review, ACM Computing Surveys, 1999, Vol. 31, No. 3, p. 264-323.

Kozielski M.: Multilevel Conditional Fuzzy C-Means Clustering of XML Documents, Lecture Notes in Artificial Intelligence, Springer-Verlag, 2007, Vol. 4702, p. 532-539.

Kozielski M.: Application of Different Clustering Algorithms to Multilevel Clustering of XML Documents, TPD 2007 Conference Proceedings, Wydawnictwo Politechniki Poznańskiej, 2007, p. 59-70.

Lian W., Cheung D. W., Mamoulis N., Yiu A. M.: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure, IEEE Transactions on Knowledge and Data Engineering, 2004, Vol. 16, No. 1, p. 82-96.

Łęski J.: Generalized Weighted Conditional Fuzzy Clustering, IEEE Transactions on Fuzzy Systems, 2003, Vol. 11, No. 6, p. 1-7.

Nayak R.: Fast and effective clustering of XML data using structural information, Knowl. Inf. Syst., 2008 , Vol. 14, No. 2, p. 197-215.

Nierman A., Jagadish H. V.: Evaluating Structural Similarity in XML Documents, Fifth International Workshop on the Web and Databases (WebDB 2002), 2002.

Pedrycz W.: Conditional Fuzzy C-Means, Pattern Recognition Letters, Vol. 17, 1996, p. 625-631.

Rocacher D.: On fuzzy bags and their application to flexible querying, Fuzzy Sets and Systems, 2003, Vol. 140, No. 1, p. 93-110.

Yoon J. P., Raghavan V., Chakilam V.: Bitmap Indexing-based Clustering and Retrieval of XML Documents, Proceedings of ACM SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval, 2001.