Least Squares Clustering

________________________________________________________________________


Project outline and aims

Clustering is an important part of data mining which aims at finding and describing coherent chunks of data in databases and data tables. This projects intends to develop a mathematical theory for clustering based on the idea that the discovered clusters must be treated as an ``ideal'' representation of the data to be used for recovering the original data back from the ideal format. This is the idea of the data recovery approach: not only use data for finding clusters but also use clusters for recovering the data. In a general situation, the data recovered from aggregate clusters cannot fit the original data exactly, which can be used for evaluation of the quality of clusters: the better the fit, the better the clusters. This perspective also leads to the addressing of issues in pre- and post-processing, which becomes possible because parts of the data that are well explained by clusters can be separated from those that are not.

This data recovery approach works due to the Pythagorean decomposition of the data scatter into ``explained'' and ``unexplained'' parts, thus extending onto clustering what works fine in more traditional data mining and statistics areas such as regression, analysis of variance and factor analysis.

The results found within this approach amount to a mathematical theory for clustering involving: (a) a considerable part of clustering techniques unified, (b) effective techniques for tackling old and new problems, (c) firm relations among various notions and algorithms in statistics, machine learning and combinatorial optimization. The unifying capability of the data recovery clustering is grounded on convenient relations between approximation problems and geometrically explicit clustering.

The approach is described in

  • my latest book Clustering for Data Mining: A Data Recovery Approach (2005), Chapman and Hall/CRC; its table of contents and introduction can be found here . The book is presented on the Publisher's web site   Chapman & Hall/CRC; and reviewed in Biomedical Engineering Online.
  • an older monograph Mathematical Classification and Clustering 1996 that provides a vast coverage of the field. Its publisher's website disappeared along with the publisher, Kluwer Academic Publishers.

    Recent invited presentations

  • Different Perspectives at Clustering: The Number-of-Clusters Case (International Federation of Classification Societies, Ljubljana Slovenia, July 2006)
  • Clustering: Tackling Challenges with Data Recovery Approach (UK Workshop on Computation Intelligence, Leeds UK, September 2006)
  • L2 and L1 Criteria for K-Means Bilinear Clustering (Workshop on Data Mining and Mathematical Programming, Montreal Canada, October 2006)

    Recent developments

  • PhD project by Susana Nascimento (2000-2002): Fuzzy clustering using proportional membership.

    Some publications: S. Nascimento, B. Mirkin, and F. Moura-Pires (2003) Modeling proportional membership in fuzzy clustering, IEEE Transactions on Fuzzy Systems, 11, no. 2, 173-186, S. Nascimento (2005) Fuzzy Clustering Via Proportional Membership Model, IOS Press, 178 p.

  • PhD project by Ito Wasito (2000-2003): Least Squares algorithms with NN techniques for imputing missing data.

    Some publications: I. Wasito, B. Mirkin (2005) Nearest neighbour approach in the least squares data imputation algorithms, Information Sciences, 169, 1-25; I. Wasito, B. Mirkin (2006) Nearest neighbours in least-squares data imputation algorithms with different missing patterns, Computational Statistics & Data Analysis 50, 926-949.

  • PhD project by Robert Stanforth (2005-2008): Intelligent K-Means clustering for analysis of Quantitative Structure Activity Relationships (QSAR).

    Some papers: Robert W. Stanforth, Evgueni Kolossov, Boris Mirkin (2007) A Measure of domain of applicability for QSAR modelling based on intelligent K-Means clustering, QSAR and Combinatorial Science, 26, no. 7, 837-844; Robert W. Stanforth, Evgueni Kolossov, Boris Mirkin (2007) Hybrid K-Means: Combining regression-wise and centroid-based criteria for QSAR (in P. Brito, P. Bertrand, G. Cucumel, F. de Carvalho (Eds.) Selected Contributions in Data Analysis and Classification, Springer, 225-233).

  • PhD project by Mark MingTso Chiang (2005-2009): Intelligent K-Means with square error and absolute error criteria: experimentation and application.

    Some papers: Mark MingTso Chiang and B. Mirkin (2006) Determining the number of clusters in the Straight K-means: Experimental comparison of eight options, UKCI 2006 Proceedings, 119-126; Mark MingTso Chiang and B. Mirkin (2010) Intelligent choice of the number of clusters in K-Means clustering: an experimental study with different cluster spreads (Journal of Classification, v.27, 1, pp. 1-38.).

  • Clustering for evolutionary reconstructions (2001-)

    Some papers: B. Mirkin, E. Koonin (2003) A top-down method for building genome classification trees with linear binary hierarchies, in M. Janowitz, J.-F. Lapointe, F. McMorris, B. Mirkin, and F. Roberts (Eds.) Bioconsensus, DIMACS Series, Providence: AMS, 97-112; B. Mirkin, R. Camargo, T. Fenner, G. Loizou and P. Kellam (2010) Similarity clustering of proteins using substantive knowledge and reconstruction of evolutionary gene histories in herpesvirus, Theoretical Chemistry Accounts, 125, nn. 3-6, 569-581. .

  • Clustering of mixed scale data (2000-)

    Some publications: B. Mirkin (2001) Eleven Ways to Look at the Chi-Squared Coefficient for Contingency Tables, The American Statistician, 55, no. 2, 111-120, and B. Mirkin (2001) Reinterpreting the Category Utility Function, Machine Learning, 45, 219-228; B. Mirkin (2002) Towards comprehensive clustering of mixed scale data with K-Means, Proceedings of the UK Workshop on Computational Intelligence, Birmingham, 2-4 September, 2002, 126-130.

  • Clustering for representing organisations over ontologies (2006-)

    B. Mirkin, S. Nascimento, L. Muniz Pereira (2007) How ACM classification can be used for profiling a University CS department , a PowerPoint presentation; ACM classification can be used for representing research organizations , DIMACS Technical Report, 2007-13, 18 p.
    ________________________________________________________________________________

    Topics for student projects

    Building a classification tree for mixed data scales (extending some ideas from paper by Mirkin 2001 Machine Learning).

    Stratification of entities over multiple performance metrics.

    Clustering text documents (extending some ideas from paper by Pampapathi, Mirkin and Levene 2006 Machine Learning)..