[Senseclusters-users] review of agglomerative methods and criterion functions

ted pedersen Thu, 18 Aug 2005 17:16:33 -0700


Now we have finally moved into agglomerative clustering and the associated
criterion functions, which are generally speaking the so-called "link"
family of functions. (I call them this at least. :) Now, remember of
course that our object in this discussion is to explore the possibilities
of automatically stopping clustering using Cluto.


In partitional methods, we must run the clustering algorithm for each
value of k, and collect the resulting criterion function at each k, and
then look at the trend of the criterion functions as k increases. We have
not yet made any decisions about how to identify the appropriate k at
which to stop the clustering, but all of the information that we need to
make such a decision is likely contained in those criterion functions.
We've now seen examples of that using a variety of the partitional
criterion functions.

Now, with agglomerative clustering things are potentially a little
different. First, in general agglomerative clustering works in a series of
steps whereby it goes from having each item in its own cluster, down to
the case where all items are in one cluster. Or it can proceed in the
other direction, where it starts with everything in one cluster and then
divides them up again. The former is hierarchical clustering (many to 1)
and the latter is divisive (1 to many). So, in fact we can look at the
value of the criterion function that is being optimized to make the
clustering decisions during agglomerative clustering, and use that as a
means of finding a stopping point. Or, we could evaluate the clusters
found at each step using one of our partitional criterion functions in
order to measure the quality of the clustering, and use that as a means of
determining the appropriate value of k. There are lots of interesting
possibilities here.

So, from the brief description above it should be clear that agglomerative
methods find clusters one at a time, and add members to clusters slowly,
one context at a time. This is unlike the partitional methods that chop up
the contexts into clusters very quickly. As such the criterion functions
for agglomerative clustering are designed for measuring the similarity
between pairs of clusters.

So here is the basic idea behind the link family of criterion functions.
You have two clusters and you are trying to figure out how similar they
are to each other. You measure all of the pairwise similarities between
the items in each cluster to each item in the other cluster, and then the
similarity between the two clusters is assessed as follows:

For single link (-crfun slink) the distance between the two clusters is
the maximum similarity / minimum distance between a pair of items in
the two clusters. In other words, the similarity between two clusters is
said to be that of the similarity of the two nearest neighbors in the
clusters. Note that slink has the tendency to "chain", that is to create
long "narrow" clusters (since you are adding nearest neighbors at each
step).

For complete link (-crfun clink) the distance between the two clusters is
the minimum similarity / maximum distance between a pair of items in the
two clusters. So the similarity between the two clusters is defined by the
furthest neighbors. clink is known to create rounder more globe-like
clusters, since you are merging the clusters whose most distant neighbors
are most similar.

For Unweighted Pair Group Method using Arithmetic mean (-crfun upgma)
the distance between two clusters is calculated by taking the average of
all the distances between the items in the two clusters. This is
comparable to average link clustering, which is comparable to McQuitty's
Similarity Analysis, which we have used in previous work.

Now, these methods can not be used with partitional clustering, because
they measure the similarity between a pair of clusters. At each step in
the clustering, they must measure the pairwise similarities between all
pairs of clusters, and then pick the two most similar clusters to merge
(assuming that we are working in a hierarchical fashion). In any case, the
partitional algorithms optimize the criterion function on all (direct) or
some portion of the data (rb and rbr) and don't need to do the exhaustive
pairwise computations that the link measures do.

Now, what is interesting is that the agglomerative algorithms can use the
partitional criterion functions! The reverse is not true (as described in
the previous paragraph), the partitional algorithms can not use the
link criterion functions. How can this be, you ask? Well, think about it
for a moment. The agglomerative method is essentially checking pairs of
clusters at each step, and finding that pair of clusters that is most
similar to each other to combine together. While you can make that
judgement using complete link (clink) you could certainly apply H2 (for
example) to each pair of clusters and find out which pair has the best H2
score.

While this is possible in theory, I am not sure if in practice it is a
great idea or not. The measures H1, H2, I1, I2, and E1 have certainly been
designed with partitional methods in mind, and single link, complete link,
an upgma have certainly been designed with agglomerative clustering mind.
So you should probably hold off on these more creative (yet perfectly
reasonable) combinations until you have a great deal of experience with
the different methods.

Next we'll take a look at some agglomerative output, and see what we can
make of it. Of course, agglomerative methods are very tempting with
respect to cluster stopping rules, because they have a natural iterative
process where you are doing evaluations of criterion functions at every
step, and so the idea is to stop that clustering at just the right moment.
Of course, that is a bit overambitious, so in all likelikehood the correct
approach is to take the clustering from N down to 1, and then look back at
the clustering decisions retrospectively and see at which point you should
have stopped things, and then go back and use that as your k value.

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

[Senseclusters-users] review of agglomerative methods and criterion functions

Reply via email to