Thanks Jeff. When I did my experiment it used kmeans for three runs k =
10, 20, 10. Number of documents around 3000 (guessing here).
The k=10 run did not prune, k=30 pruned 4 clusters. I'll run this again
to see if it is repeatable and you are welcome to the dataset.
I read that comment but was confused about the representative points.
They appear to be collected by the RepresentativePointsDriver. The only
input that looks relevant is an iteration number. I'll try increasing
that to see if the points are better chosen, I guess? Basically pruned
clusters indicate that they are not part of the analysis, and I should
do something to remedy the pruning.
I'd really like to get this working so if you have any suggestions for
what to look at I'll give it a try. I have a tiny data set (16 small
docs) I could use where you could probably calculate the CDbw by hand. k
= 1, 2 maybe.
I'll poke around and see what I can find.
On 5/17/12 2:33 PM, Jeff Eastman wrote:
Hi Pat,
I don't have a good answer here. Evidently, something in CDbw has
become broken and you are the first to notice. When I run
TestCDbwEvaluator, the values for k-means and fuzzy-k are clearly
incorrect. The values for Canopy, MeanShift and Dirichlet are not so
obviously incorrect but I remain suspicious. Something must have
become broken in the recent clustering refactoring.
From the method CDbwEvaluator.invalidCluster comment (used to enable
pruning):
* Return if the cluster is valid. Valid clusters must have more
than 2 representative points,
* and at least one of them must be different than the cluster
center. This is because the
* representative points extraction will duplicate the cluster
center if it is empty.
Oddly enough, inspection of the test log indicates that only k-means
and fuzzy-k are not pruning clusters. Clearly some more investigation
is needed. I will take a look at it tomorrow. In the mean time if you
develop any additional insight please do share it with us.
Thanks,
Jeff
On 5/17/12 3:53 PM, Pat Ferrel wrote:
I built a tool that iterates through a list of values for k on the
same data and spits out the CDbw and ClusterEvaluator results each time.
When the evaluator or CDbw prunes a cluster, how do I interpret that?
They seem to throw out the same clusters on a given run. Also CDbw
always returns an inter-cluster density of 0?
On 5/17/12 5:58 AM, Jeff Eastman wrote:
Yes, that is the paper I used to implement CDbw. I've tried it a few
times along with the simpler ClusterEvaluator metrics I took from
Mahout In Action and they look to be reasonable - see the tests -
though I have no way to judge their absolute values. Anything you
can contribute in this area would be most welcome. Perhaps a wiki page?
On 5/16/12 1:14 PM, Pat Ferrel wrote:
The reference was in the code for
http://www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf
On 5/16/12 9:56 AM, Pat Ferrel wrote:
Thanks, I've been looking at that. Is there a description of how
to interpret those values? An academic paper maybe? The
intra-cluster distance intuitively seems to correspond to
something like cohesion. I don't get the intuition behind
inter-cluster distances but Ted thinks they are the most important.
On 5/16/12 7:32 AM, Jeff Eastman wrote:
Mahout has a ClusterEvaluator and a CDbwEvaluator that compute
some quality metrics (inter-cluster distance,
intra-cluster-distance, ...) that you may find useful. Both
calculate a set of representative points from the clustering
output and compute the (n^2) metrics over these points rather
than all of the points in each cluster.
On 5/15/12 4:46 PM, Pat Ferrel wrote:
So many questions about best k, how to choose t1 and t2, how
much help is dimensional reduction would have clear answers if
we had a way to judge the quality of clusters.
Various methods were discussed here for a time:
http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
Has there been any work on building a measure of quality?