Re: Judging the quality of clustering

Jeff Eastman Thu, 17 May 2012 14:33:44 -0700

Hi Pat,

I don't have a good answer here. Evidently, something in CDbw has becomebroken and you are the first to notice. When I run TestCDbwEvaluator,the values for k-means and fuzzy-k are clearly incorrect. The values forCanopy, MeanShift and Dirichlet are not so obviously incorrect but Iremain suspicious. Something must have become broken in the recentclustering refactoring.

From the method CDbwEvaluator.invalidCluster comment (used to enablepruning):* Return if the cluster is valid. Valid clusters must have more than2 representative points,* and at least one of them must be different than the clustercenter. This is because the* representative points extraction will duplicate the cluster centerif it is empty.

Oddly enough, inspection of the test log indicates that only k-means andfuzzy-k are not pruning clusters. Clearly some more investigation isneeded. I will take a look at it tomorrow. In the mean time if youdevelop any additional insight please do share it with us.


Thanks,
Jeff

On 5/17/12 3:53 PM, Pat Ferrel wrote:

I built a tool that iterates through a list of values for k on thesame data and spits out the CDbw and ClusterEvaluator results each time.
When the evaluator or CDbw prunes a cluster, how do I interpret that?They seem to throw out the same clusters on a given run. Also CDbwalways returns an inter-cluster density of 0?
On 5/17/12 5:58 AM, Jeff Eastman wrote:
Yes, that is the paper I used to implement CDbw. I've tried it a fewtimes along with the simpler ClusterEvaluator metrics I took fromMahout In Action and they look to be reasonable - see the tests -though I have no way to judge their absolute values. Anything you cancontribute in this area would be most welcome. Perhaps a wiki page?
On 5/16/12 1:14 PM, Pat Ferrel wrote:
The reference was in the code forhttp://www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf
On 5/16/12 9:56 AM, Pat Ferrel wrote:
Thanks, I've been looking at that. Is there a description of how tointerpret those values? An academic paper maybe? The intra-clusterdistance intuitively seems to correspond to something likecohesion. I don't get the intuition behind inter-cluster distancesbut Ted thinks they are the most important.
On 5/16/12 7:32 AM, Jeff Eastman wrote:
Mahout has a ClusterEvaluator and a CDbwEvaluator that computesome quality metrics (inter-cluster distance,intra-cluster-distance, ...) that you may find useful. Bothcalculate a set of representative points from the clusteringoutput and compute the (n^2) metrics over these points rather thanall of the points in each cluster.
On 5/15/12 4:46 PM, Pat Ferrel wrote:
So many questions about best k, how to choose t1 and t2, how muchhelp is dimensional reduction would have clear answers if we hada way to judge the quality of clusters.
Various methods were discussed here for a time:http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
Has there been any work on building a measure of quality?

Re: Judging the quality of clustering

Reply via email to