The external criterion functions (E1 and E2) provided by Cluto (and used
by SenseClusters) focus on inter-cluster similarity, that is they try to
maximize the degree to which the discovered clusters are seperated without
regard to their intra-cluster similarity. In fact, these external measures
rely on the use of the centroid of the clusters, which is based on the
intra-cluster similarity so they are not strictly speaking pure external
criterion functions. But you will see they are quite different from I1
and I2.

E1 locates the centroid of the entire collection of contexts, and then
tries to find the clustering solution that results in clusters whose
centroids have the greatest difference with the overall centroid of the
entire collection. It is trying to minimize the cosine (increasing the
overall angle) between the overall centroid and the centroids of each of
the clusters in a pairwise fashion. The calculation of the angle is scaled
by the size of the cluster, so there is no advantage given to larger
clusters. As a result, when used by itself, E1 tends to result in finding
clusters of comparable size.

The idea is to find clusters whose centroids are as far apart from the
centroid of the overall collection of data as possible. Now, could this
result in finding two clusters that were close to each other but both
very far apart from the centroid? Yes, I think in fact this is possible
if E1 is used on its own, but we'll see that this is usually not
recommended.

So, E1 only cares about making sure that each cluster's centroid is far
from the overall centroid, but it does not try and make sure that each
cluster's centroid is far apart from  each other. In effect, E1 is
working in a pairwise fashion between cluster centroids and the overall
centroid. This is similar to I2, since I2 works between individual
contexts and the centroid of their cluster (but not between the
individual contexts). So E1 is not doing an exhaustive pairwise
comparison between all centroids but rather is focusing on the centroid
of each cluster relative to the centroid of the collection. As such this
is a more global or external consideration that the more localized
concerns of I1 and I2 (which are confined to within cluster distances).
E2 simply takes the sum of the cosine measurements between each cluster
and the overall centroid, and prefers the solution that results in the
lowest score (which will have the greatest angles between the cluster
centroids and the overall centroid.

Now, E2 is very similar, as it simply wishes to maximize the squared
error between the cluster centroids and the overall collection centroid.

So, you can see that both E1 and E2 focus on centroids, and not individual
contexts within a cluster. This is why they are called external criterion
functions.

Now, in general the use of E1 or E2 on their own is not recommended.
Rather, they are useful in hybrid methods that seek to balance inter
cluster similarity (I1 and I2) with intra cluster similarity (E1 in
particular). These measures are H1 and H2, and will be the subject of
my next message.

Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to