Re: CDbw and Evaluator results

Jeff Eastman Wed, 23 May 2012 06:28:21 -0700

Can you try this again using trunk? If there is no improvement I think aJIRA to investigate would be useful.


On 5/22/12 2:02 PM, Pat Ferrel wrote:

I'm using mahout 0.6 and so may not be seeing the same results as you.
I take it that the inter-cluster distance of 0 is a bug and pruningshould not happen very often?
I haven't used this before so I'm not sure if my CDbw or Evaluatorresults are wrong in other ways.
Should I create a bug for this in Jira?

On 5/17/12 2:33 PM, Jeff Eastman wrote:
Hi Pat,
I don't have a good answer here. Evidently, something in CDbw hasbecome broken and you are the first to notice. When I runTestCDbwEvaluator, the values for k-means and fuzzy-k are clearlyincorrect. The values for Canopy, MeanShift and Dirichlet are not soobviously incorrect but I remain suspicious. Something must havebecome broken in the recent clustering refactoring.
From the method CDbwEvaluator.invalidCluster comment (used to enablepruning):* Return if the cluster is valid. Valid clusters must have morethan 2 representative points,* and at least one of them must be different than the clustercenter. This is because the* representative points extraction will duplicate the clustercenter if it is empty.
Oddly enough, inspection of the test log indicates that only k-meansand fuzzy-k are not pruning clusters. Clearly some more investigationis needed. I will take a look at it tomorrow. In the mean time if youdevelop any additional insight please do share it with us.
Thanks,
Jeff

On 5/17/12 3:53 PM, Pat Ferrel wrote:
I built a tool that iterates through a list of values for k on thesame data and spits out the CDbw and ClusterEvaluator results eachtime.
When the evaluator or CDbw prunes a cluster, how do I interpretthat? They seem to throw out the same clusters on a given run. AlsoCDbw always returns an inter-cluster density of 0?
On 5/17/12 5:58 AM, Jeff Eastman wrote:
Yes, that is the paper I used to implement CDbw. I've tried it afew times along with the simpler ClusterEvaluator metrics I tookfrom Mahout In Action and they look to be reasonable - see thetests - though I have no way to judge their absolute values.Anything you can contribute in this area would be most welcome.Perhaps a wiki page?
On 5/16/12 1:14 PM, Pat Ferrel wrote:
The reference was in the code forhttp://www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf
On 5/16/12 9:56 AM, Pat Ferrel wrote:
Thanks, I've been looking at that. Is there a description of howto interpret those values? An academic paper maybe? Theintra-cluster distance intuitively seems to correspond tosomething like cohesion. I don't get the intuition behindinter-cluster distances but Ted thinks they are the most important.
On 5/16/12 7:32 AM, Jeff Eastman wrote:
Mahout has a ClusterEvaluator and a CDbwEvaluator that computesome quality metrics (inter-cluster distance,intra-cluster-distance, ...) that you may find useful. Bothcalculate a set of representative points from the clusteringoutput and compute the (n^2) metrics over these points ratherthan all of the points in each cluster.
On 5/15/12 4:46 PM, Pat Ferrel wrote:
So many questions about best k, how to choose t1 and t2, howmuch help is dimensional reduction would have clear answers ifwe had a way to judge the quality of clusters.
Various methods were discussed here for a time:http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
Has there been any work on building a measure of quality?

Re: CDbw and Evaluator results

Reply via email to