Least Squares Clustering
________________________________________________________________________
Project outline and aims
Clustering is an important part of data mining which
aims at finding coherent chunks of data in databases and data tables. This
projects intends to develop a mathematical theory for clustering based
on the following idea. Clustering is considered to approximate data by
a simpler, cluster-wise structure rather than to reveal the geometrically
explicit ``coherent clusters'' in a data point-set. The least squares is
just a criterion for finding approximate cluster-wise structures. The results
found within this, data recovery, approach amount to a mathematical theory for clustering
involving the following directions of development: (a) unifying a considerable
part of the clustering techniques, (b) developing new techniques, (c) finding
relations among various notions and algorithms both within the clustering
discipline and outside -- especially in statistics, machine learning and
combinatorial optimization.
The
unifying capability of the data recovery clustering is grounded on convenient
relations which exist between approximation problems and geometrically
explicit clustering. Based on this, the major clustering techniques have
been reformulated as locally optimal approximation algorithms and extended
to many situations untreatable with explicit approaches such as mixed- scale data
clustering. Firm mathematical relations have been found between traditional
and conceptual clustering; moreover, unexpectedly, some classical statistical
concepts such as contingency measures have been found to have meaning in
the approximation framework. These yield a set of simple but efficient
interpretation and visualization tools. Several new methods have been developed in the framework,
such as additive and principal cluster analyses, uniform partitioning,
box clustering, and fuzzy additive type clustering. In a few cases, the data recovery
clustering goes into substantive phenomena modeling, as in the case of
aggregating mobility tables.
Most recent applications:
evolutionary trees; see
B. Mirkin, E. Koonin (2003)
A top-down method for building genome classification
trees with linear binary hierarchies, in M. Janowitz, J.-F. Lapointe,
F. McMorris, B. Mirkin, and F. Roberts (Eds.) Bioconsensus, DIMACS Series,
Providence: AMS (to appear).
text analysis; see B. Mirkin (2002)
Towards comprehensive clustering of mixed scale data with K-Means,
Proceedings of the Workshop on Computational
Intelligence, Birmingham, 2-4 September, 2002, 126-130.
________________________________________________________________________________
Project publications
The data recovery approach is thoroughly described in my latest book,
Clustering for Data Mining: A Data Recovery Approach, Chapman \& Hall/CRC2005.
Other
publications:
sequential fitting approach to data analysis; see B. Mirkin (1997)
Approximation
Clustering: a Mine of Semidefinite Programming Problems, in P.Pardalos
and H.Wolkowich (Eds.) Topics in Semidefinite and Interior-Point Methods,
Fields Institute Communications Series, AMS:Providence, 167-180, and B.
Mirkin (1998) Least-Squares
Structuring, Clustering, and Data Processing Issues, The Computer Journal,
1998, 41, no. 8, 519-536.
separate-and-conquer clustering; see B. Mirkin (1999) Concept
Learning and Feature Selection Based on Square-Error Clustering, Machine
Learning, 35, 25-40.
linear embedding of hierarchies; see B. Mirkin (1997) Linear
embedding of binary hierarchies and its applications, in B. Mirkin,
F. McMorris, F. Roberts, and A. Rzhetsky (Eds.) Mathematical Hierarchies
and Biology, DIMACS Series in Discrete Mathematics and Theoretical Computer
Science, V. 37, AMS: Providence, 331-356;
B. Mirkin, E. Koonin (2003)
A top-down method for building genome classification
trees with linear binary hierarchies, in M. Janowitz, J.-F. Lapointe,
F. McMorris, B. Mirkin, and F. Roberts (Eds.) Bioconsensus, DIMACS Series,
Providence: AMS (to appear).
fuzzy clustering with proportional membership; see Nascimento, S., Mirkin,
B. and Moura-Pires, F.
(2000) A
fuzzy clustering model of data with proportional membership, in T.
Whalen (Ed.) 19th International Conference of North American Fuzzy Information
Processing Society, Piscataway, NJ: IEEE, 261-266.
box-clustering in rectangular and contingency tables; see B. Mirkin, P.
Arabie, and L. Hubert (1995)
Additive
two-mode clustering: the error-variance approach revisited, Journal
of Classification, 12, 243-263; B. Mirkin (1996) Clustering for contingency
tables: boxes and partitions, Statistics and Computing, 6, 217-229.
aggregation of contingency tables; see B. Mirkin (1999)
Three
Approaches to Aggregation of Interaction Tables, in H. Bacelar-Nicolau,
F. Costa Nicolau and J. Janssen (Eds.) Applied Stochastic Models and Data
Analysis, Lisbon: National Institute of Statistics, 30-35.
structural clustering and layered clusters; see Y. Kempner, B. Mirkin and
I. Muchnik (1997) Monotone linkage
clustering and quasi-convex set functions, Appl. Math. Letters, 10,
no. 4, 19-24;
B. Mirkin and I. Muchnik (2002)
Layered Clusters of Tightness Set Functions, Applied Mathematics Letters,
15, 147-151, and Induced Layered
Clusters, Hereditary Mappings, and Convex Geometries, Applied Mathematics
Letters, 15, 293-298.
comprehensive clustering; see B. Mirkin (2002)
Towards comprehensive clustering
of mixed-scale data with K-Means, Proceedings of the 2002 UK Workshop on
Computational Intelligence (Birmingham, September, 2002), 126-130.
clustering and association in contingency tables; see B. Mirkin (2001)
Eleven
Ways to Look at the Chi-Squared Coefficient for Contingency Tables,
The American Statistician, 55, no. 2, 111-120, and B. Mirkin (2001)
Reinterpreting
the Category Utility Function, Machine Learning, 45, 219-228.