Least Squares Clustering

________________________________________________________________________


Project outline and aims

Clustering is an important part of data mining which aims at finding coherent chunks of data in databases and data tables. This projects intends to develop a mathematical theory for clustering based on the following idea. Clustering is considered to approximate data by a simpler, cluster-wise structure rather than to reveal the geometrically explicit ``coherent clusters'' in a data point-set. The least squares is just a criterion for finding approximate cluster-wise structures. The results found within this, data recovery, approach amount to a mathematical theory for clustering involving the following directions of development: (a) unifying a considerable part of the clustering techniques, (b) developing new techniques, (c) finding relations among various notions and algorithms both within the clustering discipline and outside -- especially in statistics, machine learning and combinatorial optimization.

The unifying capability of the data recovery clustering is grounded on convenient relations which exist between approximation problems and geometrically explicit clustering. Based on this, the major clustering techniques have been reformulated as locally optimal approximation algorithms and extended to many situations untreatable with explicit approaches such as mixed- scale data clustering. Firm mathematical relations have been found between traditional and conceptual clustering; moreover, unexpectedly, some classical statistical concepts such as contingency measures have been found to have meaning in the approximation framework. These yield a set of simple but efficient interpretation and visualization tools. Several new methods have been developed in the framework, such as additive and principal cluster analyses, uniform partitioning, box clustering, and fuzzy additive type clustering. In a few cases, the data recovery clustering goes into substantive phenomena modeling, as in the case of aggregating mobility tables.

Most recent applications:

  • evolutionary trees; see B. Mirkin, E. Koonin (2003) A top-down method for building genome classification trees with linear binary hierarchies, in M. Janowitz, J.-F. Lapointe, F. McMorris, B. Mirkin, and F. Roberts (Eds.) Bioconsensus, DIMACS Series, Providence: AMS (to appear).
  • text analysis; see B. Mirkin (2002) Towards comprehensive clustering of mixed scale data with K-Means, Proceedings of the Workshop on Computational Intelligence, Birmingham, 2-4 September, 2002, 126-130.
    ________________________________________________________________________________

    Project publications

    The data recovery approach is thoroughly described in my latest book,   Clustering for Data Mining: A Data Recovery Approach, Chapman \& Hall/CRC2005.

    Other publications:
  • sequential fitting approach to data analysis; see B. Mirkin (1997) Approximation Clustering: a Mine of Semidefinite Programming Problems, in P.Pardalos and H.Wolkowich (Eds.) Topics in Semidefinite and Interior-Point Methods, Fields Institute Communications Series, AMS:Providence, 167-180, and B. Mirkin (1998) Least-Squares Structuring, Clustering, and Data Processing Issues, The Computer Journal, 1998, 41, no. 8, 519-536.
  • separate-and-conquer clustering; see B. Mirkin (1999) Concept Learning and Feature Selection Based on Square-Error Clustering, Machine Learning, 35, 25-40.
  • linear embedding of hierarchies; see B. Mirkin (1997) Linear embedding of binary hierarchies and its applications, in B. Mirkin, F. McMorris, F. Roberts, and A. Rzhetsky (Eds.) Mathematical Hierarchies and Biology, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, V. 37, AMS: Providence, 331-356; B. Mirkin, E. Koonin (2003) A top-down method for building genome classification trees with linear binary hierarchies, in M. Janowitz, J.-F. Lapointe, F. McMorris, B. Mirkin, and F. Roberts (Eds.) Bioconsensus, DIMACS Series, Providence: AMS (to appear).
  • fuzzy clustering with proportional membership; see Nascimento, S., Mirkin, B. and Moura-Pires, F. (2000) A fuzzy clustering model of data with proportional membership, in T. Whalen (Ed.) 19th International Conference of North American Fuzzy Information Processing Society, Piscataway, NJ: IEEE, 261-266.
  • box-clustering in rectangular and contingency tables; see B. Mirkin, P. Arabie, and L. Hubert (1995) Additive two-mode clustering: the error-variance approach revisited, Journal of Classification, 12, 243-263; B. Mirkin (1996) Clustering for contingency tables: boxes and partitions, Statistics and Computing, 6, 217-229.
  • aggregation of contingency tables; see B. Mirkin (1999) Three Approaches to Aggregation of Interaction Tables, in H. Bacelar-Nicolau, F. Costa Nicolau and J. Janssen (Eds.) Applied Stochastic Models and Data Analysis, Lisbon: National Institute of Statistics, 30-35.
  • structural clustering and layered clusters; see Y. Kempner, B. Mirkin and I. Muchnik (1997) Monotone linkage clustering and quasi-convex set functions, Appl. Math. Letters, 10, no. 4, 19-24; B. Mirkin and I. Muchnik (2002) Layered Clusters of Tightness Set Functions, Applied Mathematics Letters, 15, 147-151, and Induced Layered Clusters, Hereditary Mappings, and Convex Geometries, Applied Mathematics Letters, 15, 293-298.
  • comprehensive clustering; see B. Mirkin (2002) Towards comprehensive clustering of mixed-scale data with K-Means, Proceedings of the 2002 UK Workshop on Computational Intelligence (Birmingham, September, 2002), 126-130.
  • clustering and association in contingency tables; see B. Mirkin (2001) Eleven Ways to Look at the Chi-Squared Coefficient for Contingency Tables, The American Statistician, 55, no. 2, 111-120, and B. Mirkin (2001) Reinterpreting the Category Utility Function, Machine Learning, 45, 219-228.

  •