[Senseclusters-users] agglomerative output with an eye on cluster stopping

ted pedersen Fri, 19 Aug 2005 11:40:14 -0700

Now it is time to start looking at some ouput from the agglomerative
clustering. Recall that in the partitional cases we simply ran the
clustering algorithm for each value of k in some range, and then looked at
the overall criterion value computed for each value of k.


With agglomerative clustering it's not always that simple, since the
link criterion functions are not looking at the overall "fit" of the data
to the given clusters. By that I mean that when you have the rb clustering
method and the I2 criterion function, at the end of each step a value for
I2 is computed over all of the data, and which can be used as a general
measure of clustering quality. The criterion functions I1, I2, E1, H1, and
H2 have in some loose sense, a "global" view of the data.

Now, for the link criterion functions, the view is much more localized.
Remember that these criterion functions look at pairs of clusters, and the
pair of clusters that has the best link value is chosen to be merged. As
such the link criterion functions daon't have a means of being used as a
more global measurement (as we can do with the partitional methods).

But, remember, we can in fact use the partitional criterion functions
with agglomerative clustering. :) In that case, each pairwise decision is
made based on the value of the criterion function (I2 let's say) for that
local decision, and then, at the end of your clustering you can also
compute the overall I2 value for the quality of your clustering solution.
So I2 is used two different ways, for the pairwise decisions on which
cluster gets merged, and then overall for the global assessment of
clustering quality.

Let's see what that means in terms of some cluto output. Below we show the
standard output generated by cluto using the -showtree and -labeltree options.
We are using -clmethod agglom and -crfun i2 with vcluster and have
specified a value of k=3.

Now, we can see that we get the overall I2 value for the 3 cluster
solution. That could be compared to the 1 and 2 cluster values and so
forth (as we have been describing previously). To be clear, the I2 value
shown below is the sum of the I2 values for each of the three clusters.

------------------------------------------------------------------------
3-way clustering: [I2=1.37e+02] [321 of 321]
------------------------------------------------------------------------

We also get a nice summary of our discovered clusters. Clusters are
numbered from 0 to k-1 in cluto, where clusters are ordered from zero to
k-1 based on the descending value of ISim-ESim (more about that shortly).

------------------------------------------------------------------------
cid  Size  ISim  ISdev   ESim  ESdev  |
------------------------------------------------------------------------
  0    80 +0.683 +0.082 +0.043 +0.016 |
  1    80 +0.101 +0.033 +0.040 +0.013 |
  2   161 +0.080 +0.022 +0.046 +0.019 |
------------------------------------------------------------------------

The ISim value is the internal similarity among the items in a cluster,
and ESim is the external similarity of the cluster to other clusters.
Thus, a high value of ISim-ESim is associated with a tight well-separated
cluster, which is really what we want. ISdev and ESdev are the standard
deviations values of these similarities (which are averages among all the
members of the cluster). Note that if you use the -zscores option you can
see (in the clustering file output) the internal and external simlarity
for each member of each cluster (which is where these averages come from).

Now, we also can see the tree structure created via clustering (this is
what the -showtree and -labeltree options provide us). For example, below
we see that cluster 4 was split into clusters 0 and 3 (or cluster 4 is the
parent of cluster 0 and 3) and that cluster 3 was split into clusters 1
and 2 (or cluster 3 is the parent of 1 and 2). Now, you see below the
ISim and XSim values for not only the children (also shown above) but also
the parents. Then, you also get a new piece of information, the Gain.

------------------------------------------------------------------------------
Hierarchical Tree that optimizes the I2 criterion function...
------------------------------------------------------------------------------
              Size   ISim      XSim       Gain
4           [  321, 9.58e-02, 0.00e+00, -2.88e+01]
|-----0     [   80, 6.83e-01, 4.27e-02, +0.00e+00]
|-3         [  241, 6.63e-02, 4.27e-02, -8.98e+00]
  |---2     [  161, 8.04e-02, 4.35e-02, +0.00e+00]
  |---1     [   80, 1.01e-01, 4.35e-02, +0.00e+00]
------------------------------------------------------------------------------

In the case of I1, I2, H1, H2, and E1, the Gain is the amount that the
criterion function changed upon combining two clusters. For example, you
can think of cluster 4 as being created by combining clusters 0 and 3.

Finally, you can see the hierarchical clusters in a machine readable
format in your cltree file. If you specify -fulltree, you will get a tree
file that shows the hierarchy for the entire clustering, and that might
turn out to be pretty useful. But, before we get to that, let's look at
the simpler cltree format. Here is the raw form of the nicely formatted
information shown above:

4 0.000000e+00 0.000000e+00
3 0.000000e+00 0.000000e+00
3 0.000000e+00 0.000000e+00
4 4.354297e-02 -8.981900e+00
-1 4.274959e-02 -2.881339e+01

Yikes!! :)

What on earth does this tell us. Well, first you need to remember that
cluto numbers clusters from 0 to k-1. Each line of the cltree file
tells you the parent of the (n-1)th cluster, where n is the line number
in the file.

So, cluster 4 is the parent of the 0th cluster. Cluster 3 is the parent of
cluster 1. Cluster 3 is also the parent of cluster 2. Cluster 4 is the
parent of cluster 3. And then the root node is the parent of cluster 4.
That is exactly what the diagram above says.

Now, what are those values that come after the parent number? Well, the
first value is the internal similarity of all the siblings in the graph we
are creating (and these siblings will be members of the same cluster of
course). The first three values are zero, because they are leaf nodes.

Then, the second value is the Gain, which tells us how much the
criterion function changes via the combination of the 2 clusters!

Now, the gain is sort of interesting because that is how much the
criterion function changes when combining two clusters. That is, in the
end, the key piece of information for stopping an agglomerative clustering
process. If we are using the partitional criterion, we could in fact
simply look at their overall values for a series of k values, and that
might not be a bad approach. But, if you are using the link criterion, the
Gain is the only information you really have, so it's clearly something
we'll need to pay attention to. Note that in the case of the link
criterion that the Gain is not the change in a criterion function but
actually the value of the link that was judged best for the clustering
purpose.

So, by way of summary, we can see that there are two main sources of
information to be used in cluster stopping.

The first are the criterion functions I1, I2, H1, H2, and E1 that can be
used with the partitional AND the agglomerative methods. These values can
easily be obtained by running the clustering algorithm k times, and
extracting the overall value from the STDOUT that comes from cluto.

The second are the gain values that are provided for pairwise
combinations of clusters. This information is available for the
partitional methods (in addition to the above) and is the only
information available when using the link criterion functions (slink,
clink, upgma). This information is mostly conveniently available in the
tree files that are created via the -fulltree option in cluto.

Enjoy,
Ted

BTW, below I show the complete STDOUT from cluto for this run. This does
not include the various files created, just STDOUT. I chopped things up a
bit for the discussion above.

*******************************************************************************
vcluster (CLUTO 2.1) Copyright 2001-02, Regents of the University of Minnesota

Matrix Information -----------------------------------------------------------
  Name: zori3.vectors, #Rows: 321, #Columns: 5587, #NonZeros: 536355

Options ----------------------------------------------------------------------
  CLMethod=AGGLO, CRfun=I2, SimFun=Cosine, #Clusters: 3
  RowModel=None, ColModel=IDF, GrModel=SY-DIR, NNbrs=40
  Colprune=1.00, EdgePrune=-1.00, VtxPrune=-1.00, MinComponent=5
  CSType=Best, AggloFrom=0, AggloCRFun=I2, NTrials=10, NIter=10

Solution ---------------------------------------------------------------------

------------------------------------------------------------------------
3-way clustering: [I2=1.37e+02] [321 of 321]
------------------------------------------------------------------------
cid  Size  ISim  ISdev   ESim  ESdev  |
------------------------------------------------------------------------
  0    80 +0.683 +0.082 +0.043 +0.016 |
  1    80 +0.101 +0.033 +0.040 +0.013 |
  2   161 +0.080 +0.022 +0.046 +0.019 |
------------------------------------------------------------------------

------------------------------------------------------------------------------
Hierarchical Tree that optimizes the I2 criterion function...
------------------------------------------------------------------------------
              Size   ISim      XSim       Gain
4           [  321, 9.58e-02, 0.00e+00, -2.88e+01] []
|-----0     [   80, 6.83e-01, 4.27e-02, +0.00e+00] []
|-3         [  241, 6.63e-02, 4.27e-02, -8.98e+00] []
  |---2     [  161, 8.04e-02, 4.35e-02, +0.00e+00] []
  |---1     [   80, 1.01e-01, 4.35e-02, +0.00e+00] []
------------------------------------------------------------------------------

Timing Information -----------------------------------------------------------
   I/O:                                   2.010 sec
   Clustering:                            1.400 sec
   Reporting:                             1.090 sec
*******************************************************************************

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

[Senseclusters-users] agglomerative output with an eye on cluster stopping

Reply via email to