We have thus far focused our discussions on criterion functions that are
typically associated with partitional clustering methods in cluto.
However, we also need to consider the "link" family of criterion
functions which may only be used with agglomerative clustering methods.
This will raise a number of interesting distinctions between the
clustering methods and criterion functions, so let's start with a bit of
review.

The criterion functions typically used with partitional methods include
I1, I2, E1, and H1 and H2. Of special note is I2, the "flower", since it
is the default for cluto's partitional methods. (You can select any of the
other criterion functions via the -crfun option). By flower, I mean that
I2 measures the pairwise similarities between each point in a cluster
with the centroid of its cluster. It figures a similarity value for each
cluster in this way, and then sums them together to create an overall
measurement for a given clustering solution.

The partitional methods included with cluto are repeated bisections (rb),
repeated bisections with global optimization (rbr), and direct (direct)
clustering. A basic rule of thumb is that rb or rbr are good choices,
unless you have a very small amount of data, in which case direct is
good. By small, I mean maybe 20 contexts to cluster. The I2 criterion
function is the default in cluto for all these methods.

In general, a partitional method will be given a desired number of
clusters k and immediately go about dividing that data into k clusters,
and then attempt to find the clustering that optimizes the given
criterion function. The typical example of this is k-means, which starts
by picking k random centroids in your data, and then reassigns items to
clusters such that your criterion function is optimized, and the
recomputing the k centroids. It keeps doing this until the centroids are
stabilized from iteration to iteration.

Now, in cluto you can control a few things about the partional methods
that affect how hard it looks for a solution. For example, the -ntrials
option lets you select how many different solutions are computed by the
partitional algorithm. In general, the solutions depend to some degree on
the random choices made at the start, so the algorithms are re-run
multiple times and then the best solution (according to the criterion
function) is selected. By default -ntrials is 10, but if you have very
large or otherwise unpredictable data, you might want to make this larger.
You can also control the maximum number of refinements to be performed at
each step in the clustering, via the use of the -niter option.

In cluto, rb is the designated work horse. It is the default method for
vcluster, and if you aren't sure what method to use, it's a nice choice.
It starts by bisecting the data into 2 clusters, and then one of these
groups is selected and bisected again. This process of dividing one of the
existing clusters into two continues until you reach the desired number
of clusters. Each time a bisection is done, it optimizes whatever
criterion function you have chosen. (And if you haven't chosen you get
I2). So, it's like a pairwise k-means, very loosely speaking. If you
don't tell vcluster what kind of  clustering algorithm you want (via the
-clmethod option) you get rb. Now, as best as I understand it, rbr
proceeds just like rb, except that at the very end, once it finds the k
clusters it goes through a global optimization step to make sure the k
clusters you have found optimize your criterion function.

Well, I may as well explain direct now too. :) In fact, I'm not too sure
how direct works - I think it is more or less like k-means - you give it
some number of clusters k, and it will go ahead and find those k clusters
all at once (not via the bisecting process of rb or rbr).

Now, there is one important distinction that must be made here - your
criterion function and your similarity function are not the same thing!
Your similarity function is how you measure the pairwise similarity
between two contexts that you wish to cluster. When using vcluster, your
choices for this are the cosine between the two contexts, or the
correlation coefficient between them. These are controlled by the -sim
option, where get cosine by specifying -sim cos and correlation by -sim
corr. Note that cos is the default. If you use graph clustering there are
some other similarity functions, but we'll talk about those another time.
(They include the well known euclidean distance and jaccard coefficient).
So your criterion function will optimize an overall score for a cluster
or clustering, and the individual pairwise scores that it gets for
contexts will be from your similarity function. In general for
SenseClusters cosine is the right choice for vcluster, since we do a lot
of similarity measurements prior to giving the data to cluto, so typically
your vectors are ready to cluster "as is". That said, I have never really
explored the other options very much, so they might work nicely, I just
don't know.

Well, finally at last we are prepared to enter into a discussion of
agglomerative clustering and the associated criterion functions. I think
however, that the above note is long enough, so I will send it and then
start another one. :)

Enjoy,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to