Now we have finally moved into agglomerative clustering and the associated criterion functions, which are generally speaking the so-called "link" family of functions. (I call them this at least. :) Now, remember of course that our object in this discussion is to explore the possibilities of automatically stopping clustering using Cluto.
In partitional methods, we must run the clustering algorithm for each value of k, and collect the resulting criterion function at each k, and then look at the trend of the criterion functions as k increases. We have not yet made any decisions about how to identify the appropriate k at which to stop the clustering, but all of the information that we need to make such a decision is likely contained in those criterion functions. We've now seen examples of that using a variety of the partitional criterion functions. Now, with agglomerative clustering things are potentially a little different. First, in general agglomerative clustering works in a series of steps whereby it goes from having each item in its own cluster, down to the case where all items are in one cluster. Or it can proceed in the other direction, where it starts with everything in one cluster and then divides them up again. The former is hierarchical clustering (many to 1) and the latter is divisive (1 to many). So, in fact we can look at the value of the criterion function that is being optimized to make the clustering decisions during agglomerative clustering, and use that as a means of finding a stopping point. Or, we could evaluate the clusters found at each step using one of our partitional criterion functions in order to measure the quality of the clustering, and use that as a means of determining the appropriate value of k. There are lots of interesting possibilities here. So, from the brief description above it should be clear that agglomerative methods find clusters one at a time, and add members to clusters slowly, one context at a time. This is unlike the partitional methods that chop up the contexts into clusters very quickly. As such the criterion functions for agglomerative clustering are designed for measuring the similarity between pairs of clusters. So here is the basic idea behind the link family of criterion functions. You have two clusters and you are trying to figure out how similar they are to each other. You measure all of the pairwise similarities between the items in each cluster to each item in the other cluster, and then the similarity between the two clusters is assessed as follows: For single link (-crfun slink) the distance between the two clusters is the maximum similarity / minimum distance between a pair of items in the two clusters. In other words, the similarity between two clusters is said to be that of the similarity of the two nearest neighbors in the clusters. Note that slink has the tendency to "chain", that is to create long "narrow" clusters (since you are adding nearest neighbors at each step). For complete link (-crfun clink) the distance between the two clusters is the minimum similarity / maximum distance between a pair of items in the two clusters. So the similarity between the two clusters is defined by the furthest neighbors. clink is known to create rounder more globe-like clusters, since you are merging the clusters whose most distant neighbors are most similar. For Unweighted Pair Group Method using Arithmetic mean (-crfun upgma) the distance between two clusters is calculated by taking the average of all the distances between the items in the two clusters. This is comparable to average link clustering, which is comparable to McQuitty's Similarity Analysis, which we have used in previous work. Now, these methods can not be used with partitional clustering, because they measure the similarity between a pair of clusters. At each step in the clustering, they must measure the pairwise similarities between all pairs of clusters, and then pick the two most similar clusters to merge (assuming that we are working in a hierarchical fashion). In any case, the partitional algorithms optimize the criterion function on all (direct) or some portion of the data (rb and rbr) and don't need to do the exhaustive pairwise computations that the link measures do. Now, what is interesting is that the agglomerative algorithms can use the partitional criterion functions! The reverse is not true (as described in the previous paragraph), the partitional algorithms can not use the link criterion functions. How can this be, you ask? Well, think about it for a moment. The agglomerative method is essentially checking pairs of clusters at each step, and finding that pair of clusters that is most similar to each other to combine together. While you can make that judgement using complete link (clink) you could certainly apply H2 (for example) to each pair of clusters and find out which pair has the best H2 score. While this is possible in theory, I am not sure if in practice it is a great idea or not. The measures H1, H2, I1, I2, and E1 have certainly been designed with partitional methods in mind, and single link, complete link, an upgma have certainly been designed with agglomerative clustering mind. So you should probably hold off on these more creative (yet perfectly reasonable) combinations until you have a great deal of experience with the different methods. Next we'll take a look at some agglomerative output, and see what we can make of it. Of course, agglomerative methods are very tempting with respect to cluster stopping rules, because they have a natural iterative process where you are doing evaluations of criterion functions at every step, and so the idea is to stop that clustering at just the right moment. Of course, that is a bit overambitious, so in all likelikehood the correct approach is to take the clustering from N down to 1, and then look back at the clustering decisions retrospectively and see at which point you should have stopped things, and then go back and use that as your k value. -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
