The external criterion functions (E1 and E2) provided by Cluto (and used by SenseClusters) focus on inter-cluster similarity, that is they try to maximize the degree to which the discovered clusters are seperated without regard to their intra-cluster similarity. In fact, these external measures rely on the use of the centroid of the clusters, which is based on the intra-cluster similarity so they are not strictly speaking pure external criterion functions. But you will see they are quite different from I1 and I2.
E1 locates the centroid of the entire collection of contexts, and then tries to find the clustering solution that results in clusters whose centroids have the greatest difference with the overall centroid of the entire collection. It is trying to minimize the cosine (increasing the overall angle) between the overall centroid and the centroids of each of the clusters in a pairwise fashion. The calculation of the angle is scaled by the size of the cluster, so there is no advantage given to larger clusters. As a result, when used by itself, E1 tends to result in finding clusters of comparable size. The idea is to find clusters whose centroids are as far apart from the centroid of the overall collection of data as possible. Now, could this result in finding two clusters that were close to each other but both very far apart from the centroid? Yes, I think in fact this is possible if E1 is used on its own, but we'll see that this is usually not recommended. So, E1 only cares about making sure that each cluster's centroid is far from the overall centroid, but it does not try and make sure that each cluster's centroid is far apart from each other. In effect, E1 is working in a pairwise fashion between cluster centroids and the overall centroid. This is similar to I2, since I2 works between individual contexts and the centroid of their cluster (but not between the individual contexts). So E1 is not doing an exhaustive pairwise comparison between all centroids but rather is focusing on the centroid of each cluster relative to the centroid of the collection. As such this is a more global or external consideration that the more localized concerns of I1 and I2 (which are confined to within cluster distances). E2 simply takes the sum of the cosine measurements between each cluster and the overall centroid, and prefers the solution that results in the lowest score (which will have the greatest angles between the cluster centroids and the overall centroid. Now, E2 is very similar, as it simply wishes to maximize the squared error between the cluster centroids and the overall collection centroid. So, you can see that both E1 and E2 focus on centroids, and not individual contexts within a cluster. This is why they are called external criterion functions. Now, in general the use of E1 or E2 on their own is not recommended. Rather, they are useful in hybrid methods that seek to balance inter cluster similarity (I1 and I2) with intra cluster similarity (E1 in particular). These measures are H1 and H2, and will be the subject of my next message. Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
