As some of you know, one of our major long term goals for SenseClusters
is to choose (automatically) the number of clusters, rather than
requiring that the user pre-specify the desired number of clusters. This
is a hard problem, and there are many suggested ways of doing that.

One method we are actively looking at is called the GAP Statistic
(Tibshirani, et. al., 2001), which is a resampling method that compares
clusters to a reference distribution in order to determine the "best"
clustering. It has the disadvantage of being rather complex
computationally, and it's not clear (to me at least) how one goes about
choosing a reasonable reference distribution given the nature of our data.
But, it's still quite interesting and might be a great way to handle this
problem.

There is also a fairly long tradition of work that looks at the
characteristics of the clustering as it proceeds, and does a test to see
if a stopping criteria has been reached. These are sometimes called
"stopping rules", and they have the nice characteristic of being
relatively simple computationally, and also somewhat intuitive in their
formulation. More or less you start clustering and you take these
measurements at each step, and when you reach an appropriate number
of clusters (according to the measure or stopping rule).

Here is a quick summary of a few stopping rules that I have been looking
at of late. Take these descriptions with a small grain of salt, I may not
have completely understood them quite yet, but I wanted to get them out
into "public discussion" and see if anyone has strong feelings one way or
the other regarding which of these techniques is most appropriate.

=========================================
CH stopping rule

At each step in the clustering, where the number of observations to be
clustered and the step number is designated by k...

Compute the sum of squares between the cluster k (the cluster created
at this step) and all other clusters, and compute the internal sum of
squares distance for the cluster k.

B(k) - between cluster sum of squares
W(k) - within cluster sum of squares

        B(k)/(k-1)
CH(k) = ----------
        W(k)/(n-k)

This is a ratio of the external to internal distance for a cluster k, so a
large value means k clusters is "good", while a low values means that k
cluster is not so good. Then one can set a threshold for CH(k) and stop
the clustering when that value is reached. Or, perhaps one could "plot"
CH(k) and look for a sudden change in this ratio as k decrements.

Proposed by:

Calinski and Harabasz (1974)
A Dendrite Method for Cluster Analysis
Communications in Statistics, 3(1), 1974, 1-27

Recommended by (after comparative study of 30 stopping rules!):

Milligan and Cooper, 1985
An Examination of Procedures for Determining the Number of Clusters
in a Data Set, Psychometrik, 58(2), 1985, 159-179

============================================================

Hartigan stopping rule

Similar to CH(k) except it is based on internal distances only, and
it compares the current cluster (k) with the cluster from the previous
step (k+1).

           W(k)
H(k) = (--------- - 1) * (N - k - 1)
          W(k+1)

stop when H(k) < some threshold, 10 is sometimes suggested as
good value but this will vary depending on data.

Hartigan, J., 1975, Clustering algorithms (New York: John Wiley & Sons).
=========================================
Mojena stopping rule

Let a1, a2,..., a(N-1) be the distances at which the clusters merge,
where N is the number of observations in the data.

Let A be the average of these, and let s be the standard deviation.

Look at the successive values of

z1 = (a1-A)/s
z2 = (a2-A)/s
etc.

and "break" the clustering at the first point where zi is greater
than some threshold k. (This is not the same k as we use for the
number of clusters). Commonly recommended values of k are 1.25 or
2, but this will depend on your data.

Mojena, R (1977) Hierarchical grouping methods and stopping rules: an
evaluation, Computer Journal, v. 20, 353-363.

=========================================
Silhouette value

Each observation is assigned a value known as its silhouette.

      bj - aj
Sj = ---------
    max (aj,bj)

where aj and bj are the average dissimilarities of observation j
with the members of its own cluster (aj) and its closest neighboring
cluster (bj). Note that -1 <= Sj <= 1. A value of 1 means that the
observation is "well clustered" (very similar to other members of
cluster, less so to other cluster) while a value of -1 means that
the element is probably mis-classified.

The average silhouette value for all the observations in a cluster
is computed, and this value represents the overall goodness of that
cluster. All of the averaged values for all of the clusters that
exist is also computed, and the number of clusters with the
maximum overall average silhouette is taken as the optimal number
of clusters.

Kaufman, L. and Rousseeuw, P.J. (1990) Finding Groups in Data: An
Introduction to Cluster Analysis. Wiley, New York.

Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation
and validation of cluster analysis. J. Comput. Appl. Math., 20, 53.65.

===============================================================

Summarizing comments : these methods all strike me as somewhat similar in
spirit, with a few differences between them of course. But they tend
to try and find those points at which clustering "changes" by measuring
similarity between internal and external distances (CH) or by measuring
between the current and previous clustering step (Silhouette, Mojena,
Hartigan). As such I think these represent nice starting points for trying
to figure out how to stop clustering - they capture a certain intuitive
about the problem that makes sense (ie we should stop clustering when
the clusters we are finding aren't very different from each other) and
they seem rather simple computationally, especially CH, Mojena, and
Hartigan. Note that CH came out well in a very extensive comparative
study, so maybe that should give it a slightly higher priority in our
investigations.

One might wonder about the randomness associated with clustering - in
other words, at various points clustering algorithms have equally
good choices to make, and they choose randomly. This means results
can vary from run to run for the same data, and could in theory
result in different k values. One reasonable way to handle this is
to simply run the clustering M times and find the best value of k
in each of M cases - hopefully k will be fairly consistent across
these multiple runs, and if it is we can use that value. If it is
wildly inconsistent, it might suggest that we need more data or
more features before figuring our a value of k.

Please feel free to comment on any of the above, and also point out
any errors that might exist in my interpretations. If there are other
measures that might be good for stopping rules, please let us
know!

Cordially,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to