The importance of selection of metrics in the analysis of separation between clusters

Łukasz Paśko, Galina Setlak

Abstract


The aim of this paper is to examine the importance of selection of metric during the analysis of separation between clusters of objects in the feature space. Fourteen metrics known from the literature were selected for the calculations. Seven datasets that differ in the number of objects, attributes, and clusters were examined. For each of them, the four cluster separation measures were calculated. The article contains selected results with particular emphasis on the differences arising from the use of various metrics.

Keywords


separation of clusters; metrics; measures of the quality of clustering

Full Text:

PDF (Polski)

References


Alcalá-Fdez J., Fernandez A., Luengo J., Derrac J., García S., Sánchez L., Herrera F.: KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. Journal of Multiple-Valued Logic and Soft Computing, Vol. 17, No. 2÷3, 2011, s. 255÷287.

Brun M., Sima C., Hua J., Lowey J., Carroll B., Suh E., Dougherty E.R.: Model-based evaluation of clustering validation measures. Pattern Recognition, Vol. 40, No. 3, Elsevier, 2007, s. 807÷824.

Cha S.: Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, Vol. 1, No. 4, 2007, s. 300÷307.

Cox T.F., Cox M.A.A: Multidimensional Scaling, 2nd edition. Chapman & Hall/CRC Press, 2000.

Deza M.M., Deza E.: Encyclopedia of distances. Springer-Verlag, Berlin, Heidelberg 2009.

Dolnicar S.: Using cluster analysis for market segmentation – typical misconceptions, established methodological weaknesses and some recommendations for improvement. Australasian Journal of Market Research, Vol. 11, No. 2, 2003, s. 5÷12.

Everitt B.S., Landau S., Leese M.: Cluster analysis. Wiley Publishing, Nowy Jork 2009.

Gavin D.G., Oswald W.W., Wahl E.R., Williams J.W.: A statistical approach to evaluating distance metrics and analog assignments for pollen records. Quaternary Research, Vol. 60, 2003, s. 356÷367.

Gordon A.D.: Classification, 2nd edition. Chapman & Hall/CRC Press, 1999.

Hand D., Mannila H., Smyth P.: Eksploracja danych. WNT, Warszawa 2005.

Jain A.K., Dubes R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, New Jersey 1988.

Jain A.K., Murty M.N., Flynn P.J.: Data clustering: a review. ACM Computing Surveys, Vol. 31, No. 3, 1999, s. 264÷323.

Krause E.F.: Taxicab Geometry: An Adventure in Non-Euclidean Geometry. Dover, New York 1986.

Krivulin N.: An algebraic approach to multidimensional minimax location problems with Chebyshev distance. WSEAS Transaction on Mathematics, Vol. 10, No. 6, 2011, s. 191÷200.

Meila M.: Comparing clusterings – an information based distance. Journal of Multivariate Analysis, Vol. 98, No. 5, 2007, s. 873÷895.

Monev V.: Introduction to similarity searching in chemistry. MATCH Communications in Mathematical and in Computer Chemistry, Vol. 51, 2004, s. 7÷38.

Osowski S.: Metody i narzędzia eksploracji danych. Wydawnictwo BTC, Legionowo 2013.

Paśko Ł., Setlak G.: Ocena segmentacji rynku za pomocą miar jakości grupowania danych. Zeszyty Naukowe Politechniki Śląskiej, Seria Informatyka, Vol. 35, No. 2(116), Gliwice 2014, s. 157÷173.

Paśko Ł., Setlak G.: Wpływ wybranych metryk na wynik badania skupisk. Zeszyty Naukowe Politechniki Śląskiej, Seria Informatyka, Vol. 36, No. 1(119), Gliwice 2015, s. 31÷45.

Setlak G., Paśko Ł.: Zastosowanie metod eksploracji danych do segmentacji rynków. Zeszyty Naukowe Politechniki Śląskiej, Seria Informatyka, Vol. 34, No. 2A(111), Gliwice 2013, s. 311÷323.

http://sci2s.ugr.es/keel/datasets.php – wykorzystane w czasie badań zbiory danych od I do VI wraz z ich opisem – ostatni dostęp 7.02.2016 r.




DOI: http://dx.doi.org/10.21936/si2016_v37.n1.753