We have thus far focused our discussions on criterion functions that are typically associated with partitional clustering methods in cluto. However, we also need to consider the "link" family of criterion functions which may only be used with agglomerative clustering methods. This will raise a number of interesting distinctions between the clustering methods and criterion functions, so let's start with a bit of review.
The criterion functions typically used with partitional methods include I1, I2, E1, and H1 and H2. Of special note is I2, the "flower", since it is the default for cluto's partitional methods. (You can select any of the other criterion functions via the -crfun option). By flower, I mean that I2 measures the pairwise similarities between each point in a cluster with the centroid of its cluster. It figures a similarity value for each cluster in this way, and then sums them together to create an overall measurement for a given clustering solution. The partitional methods included with cluto are repeated bisections (rb), repeated bisections with global optimization (rbr), and direct (direct) clustering. A basic rule of thumb is that rb or rbr are good choices, unless you have a very small amount of data, in which case direct is good. By small, I mean maybe 20 contexts to cluster. The I2 criterion function is the default in cluto for all these methods. In general, a partitional method will be given a desired number of clusters k and immediately go about dividing that data into k clusters, and then attempt to find the clustering that optimizes the given criterion function. The typical example of this is k-means, which starts by picking k random centroids in your data, and then reassigns items to clusters such that your criterion function is optimized, and the recomputing the k centroids. It keeps doing this until the centroids are stabilized from iteration to iteration. Now, in cluto you can control a few things about the partional methods that affect how hard it looks for a solution. For example, the -ntrials option lets you select how many different solutions are computed by the partitional algorithm. In general, the solutions depend to some degree on the random choices made at the start, so the algorithms are re-run multiple times and then the best solution (according to the criterion function) is selected. By default -ntrials is 10, but if you have very large or otherwise unpredictable data, you might want to make this larger. You can also control the maximum number of refinements to be performed at each step in the clustering, via the use of the -niter option. In cluto, rb is the designated work horse. It is the default method for vcluster, and if you aren't sure what method to use, it's a nice choice. It starts by bisecting the data into 2 clusters, and then one of these groups is selected and bisected again. This process of dividing one of the existing clusters into two continues until you reach the desired number of clusters. Each time a bisection is done, it optimizes whatever criterion function you have chosen. (And if you haven't chosen you get I2). So, it's like a pairwise k-means, very loosely speaking. If you don't tell vcluster what kind of clustering algorithm you want (via the -clmethod option) you get rb. Now, as best as I understand it, rbr proceeds just like rb, except that at the very end, once it finds the k clusters it goes through a global optimization step to make sure the k clusters you have found optimize your criterion function. Well, I may as well explain direct now too. :) In fact, I'm not too sure how direct works - I think it is more or less like k-means - you give it some number of clusters k, and it will go ahead and find those k clusters all at once (not via the bisecting process of rb or rbr). Now, there is one important distinction that must be made here - your criterion function and your similarity function are not the same thing! Your similarity function is how you measure the pairwise similarity between two contexts that you wish to cluster. When using vcluster, your choices for this are the cosine between the two contexts, or the correlation coefficient between them. These are controlled by the -sim option, where get cosine by specifying -sim cos and correlation by -sim corr. Note that cos is the default. If you use graph clustering there are some other similarity functions, but we'll talk about those another time. (They include the well known euclidean distance and jaccard coefficient). So your criterion function will optimize an overall score for a cluster or clustering, and the individual pairwise scores that it gets for contexts will be from your similarity function. In general for SenseClusters cosine is the right choice for vcluster, since we do a lot of similarity measurements prior to giving the data to cluto, so typically your vectors are ready to cluster "as is". That said, I have never really explored the other options very much, so they might work nicely, I just don't know. Well, finally at last we are prepared to enter into a discussion of agglomerative clustering and the associated criterion functions. I think however, that the above note is long enough, so I will send it and then start another one. :) Enjoy, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
