As you might recall, we have described a reasonably large number of
criterion functions that are used by cluto during clustering to measure
inter and intra cluster similarity/difference. These values are used by
cluto during clustering to figure out the optimal arrangement of contexts
into clusters (so as to maximize or minimize the desired citerion
function, as the case may be.) The criterion functions we have discussed
thus far are I1, I2, E1, E2, H1, and H2, and may be used with
aggglomerative or partitional clustering. (The family of "link" criterion
functions can also be used with agglomerative algorithms - complete link,
single link, etc. More about those in due course).
So, as you know Cluto/SenseClusters requires you to specify the number of
clusters you want a priori. The point of this discussion is to propose a
few ways to use Cluto's very fast and flexible means of computing
criterion functions as a way to formulate stopping rules. Note that in
previous messasges we have illustrated the basic icea - when looking at a
criterion function computed for various values of k, there might be a knee
or elbow in the graph that gives us a good clue as to what the best k
should be.
Now, one specific proposal for how to compute stopping rules was made by
Hartigan, 1975. This is described in a previous note to this list, but to
review, the stopping rule is formulated like this...
W(k)
H(k) = (--------- - 1) * (Q)
W(k+1)
where Q = N - k - 1
W( ) is said to be a measure of intra cluster similarity, so we might
wish to use I1 or I2 here. However, I see no particular reason why
other criterion functions could not be employed here either - for example,
you could use E1 if you wanted to focus on the inter-cluster similarity.
The Q term is a bit of a mystery to me, although it makes me think of
degrees of freedom (number of contexts - number of clusters - 1). So we
take the ratio of the similarity/difference scores and scale that by the
degrees of freedom in the clustering problem? (The latter is sort of
speculative and requires a bit more looking at I think).
But, the basic idea is to compare the ratio of the similarity score
assocated with successive steps in the clustering algorithm, and to stop
the clustering when the value of H(k) is less than or greater than some
cutoff. We start with k=1 cluster, and move to larger number of clusters
in a stepwise fashion. Note that this is what divisive agglomerative
algorithms do naturally, and what we can force partitional algorithms to
do by running them for each value of k we are interested in.
If W( ) is a measure of dissimilarity, then W(k) will always be greater
than or equal to W(k+1), while if W( ) is a measure of similarity then
W(k) will always be less than or equal to W(k+1). The intuition here
is that as you create more clusters, the members of the clusters are more
like each other (and therefore less dissimilarity), reaching a point of
maximal similarity when there is one context per cluster.
Now, when using a measure of similarity, you must actually set a
threshold for a score to be exceeded, and when that occurs you have found
your k value. Let us consider what is happening - if the intra-cluster
similarity associated with k and k+1 clusters is very close,
then there is no particular advantage to forming either number of
clusters, and the ratio will be very close to 1. When subtracted from 1,
you get a negative value close to 0 (eg. -.1) which will cause the overall
score to be a rather small. This is probably a signal that you should have
already stopped clustering!
If in taking the step from k to k+1 clusters you have a bigger differences
in similarity scores, you might be at one of these knee points, which you
will see with a sudden rise in the Hartigan score. In this case, moving
from k to k+1 clusters makes a difference in the associated criterion
functions, so the ratio will be driven down (since W(k) < W(k+1) for
similarity scores).
Let's look at a specific example, taken from our Mexico-Brazil data that
was used in the previous examples of the criterion scores. In this case N
is 321, and we are using the I2 criterion function (which tries to
optimize the similarity of the centroids of clusters with the contexts -
I have called this the flower from time to time).
I really don't know what a good threshold would be. I suspect that the
trend in the Hartigan values is really what matters. Let's see...
Here we go, starting at k=1.
k=1 Q=319 W(1)=99.3 H(1)=-73.4
zori3.i2.output:1-way clustering: [I2=9.93e+01] [321 of 321]
-73 is a pretty small negative number, so maybe we wouldn't stop here...
k=2 Q=318 W(2)=129.0 H(2)=-22.9
zori3.i2.output:2-way clustering: [I2=1.29e+02] [321 of 321]
Ah, a big jump to -22 which is a larger value...
k=3 Q=317 W(3)=139.0 H(3)=-13.1
zori3.i2.output:3-way clustering: [I2=1.39e+02] [321 of 321]
Another jump, not quite as great a from k=1 to 2, but still observable.
k=4 Q=316 W(4)=145.00 H(4)=-8.4
zori3.i2.output:4-way clustering: [I2=1.45e+02] [321 of 321]
Now at k=4 we seem to have settled down to a situation where the
W(k) and W(k+1) values are pretty close to each other, so we should
have stopped clustering by now. The question really becomes whether
or not the value at k=2 or k=3 or perhaps k=4 was large enough to
stop clustering.
k=5 Q=315 W(5)=149.00 H(5)=-8.2
zori3.i2.output:5-way clustering: [I2=1.49e+02] [321 of 321]
k=6 Q=314 W(6)=153.00 H(6)=-6.0
zori3.i2.output:6-way clustering: [I2=1.53e+02] [321 of 321]
k=7 Q=313 W(7)=156.00 H(7)=-7.8
zori3.i2.output:7-way clustering: [I2=1.56e+02] [321 of 321]
k=8 Q=312 W(8)=160.00 H(8)=-3.9
zori3.i2.output:8-way clustering: [I2=1.60e+02] [321 of 321]
k=9 Q=311 W(9)=162.00 H(9)=-5.65
zori3.i2.output:9-way clustering: [I2=1.62e+02] [321 of 321]
...
So, that's an example of how a Hartigan-like stopping rule could be
formulated. Seems rather neat. :)
Now, is this what SenseClusters should do - I am not sure yet. We are
just thinking right now, getting different ideas. There are other methods
to consider.
Enjoy,
Ted
--
Ted Pedersen
http://www.d.umn.edu/~tpederse
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users