Re: Judging the quality of clustering

Pat Ferrel Fri, 18 May 2012 09:27:30 -0700

Thanks Jeff. When I did my experiment it used kmeans for three runs k =10, 20, 10. Number of documents around 3000 (guessing here).

The k=10 run did not prune, k=30 pruned 4 clusters. I'll run this againto see if it is repeatable and you are welcome to the dataset.

I read that comment but was confused about the representative points.They appear to be collected by the RepresentativePointsDriver. The onlyinput that looks relevant is an iteration number. I'll try increasingthat to see if the points are better chosen, I guess? Basically prunedclusters indicate that they are not part of the analysis, and I shoulddo something to remedy the pruning.

I'd really like to get this working so if you have any suggestions forwhat to look at I'll give it a try. I have a tiny data set (16 smalldocs) I could use where you could probably calculate the CDbw by hand. k= 1, 2 maybe.


I'll poke around and see what I can find.

On 5/17/12 2:33 PM, Jeff Eastman wrote:

Hi Pat,
I don't have a good answer here. Evidently, something in CDbw hasbecome broken and you are the first to notice. When I runTestCDbwEvaluator, the values for k-means and fuzzy-k are clearlyincorrect. The values for Canopy, MeanShift and Dirichlet are not soobviously incorrect but I remain suspicious. Something must havebecome broken in the recent clustering refactoring.
From the method CDbwEvaluator.invalidCluster comment (used to enablepruning):* Return if the cluster is valid. Valid clusters must have morethan 2 representative points,* and at least one of them must be different than the clustercenter. This is because the* representative points extraction will duplicate the clustercenter if it is empty.
Oddly enough, inspection of the test log indicates that only k-meansand fuzzy-k are not pruning clusters. Clearly some more investigationis needed. I will take a look at it tomorrow. In the mean time if youdevelop any additional insight please do share it with us.
Thanks,
Jeff

On 5/17/12 3:53 PM, Pat Ferrel wrote:
I built a tool that iterates through a list of values for k on thesame data and spits out the CDbw and ClusterEvaluator results each time.
When the evaluator or CDbw prunes a cluster, how do I interpret that?They seem to throw out the same clusters on a given run. Also CDbwalways returns an inter-cluster density of 0?
On 5/17/12 5:58 AM, Jeff Eastman wrote:
Yes, that is the paper I used to implement CDbw. I've tried it a fewtimes along with the simpler ClusterEvaluator metrics I took fromMahout In Action and they look to be reasonable - see the tests -though I have no way to judge their absolute values. Anything youcan contribute in this area would be most welcome. Perhaps a wiki page?
On 5/16/12 1:14 PM, Pat Ferrel wrote:
The reference was in the code forhttp://www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf
On 5/16/12 9:56 AM, Pat Ferrel wrote:
Thanks, I've been looking at that. Is there a description of howto interpret those values? An academic paper maybe? Theintra-cluster distance intuitively seems to correspond tosomething like cohesion. I don't get the intuition behindinter-cluster distances but Ted thinks they are the most important.
On 5/16/12 7:32 AM, Jeff Eastman wrote:
Mahout has a ClusterEvaluator and a CDbwEvaluator that computesome quality metrics (inter-cluster distance,intra-cluster-distance, ...) that you may find useful. Bothcalculate a set of representative points from the clusteringoutput and compute the (n^2) metrics over these points ratherthan all of the points in each cluster.
On 5/15/12 4:46 PM, Pat Ferrel wrote:
So many questions about best k, how to choose t1 and t2, howmuch help is dimensional reduction would have clear answers ifwe had a way to judge the quality of clusters.
Various methods were discussed here for a time:http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
Has there been any work on building a measure of quality?

Re: Judging the quality of clustering

Reply via email to