I'm only on 0.6, nothing very recent. Sent from my iPhone
On May 17, 2012, at 2:33 PM, Jeff Eastman <[email protected]> wrote: > Hi Pat, > > I don't have a good answer here. Evidently, something in CDbw has become > broken and you are the first to notice. When I run TestCDbwEvaluator, the > values for k-means and fuzzy-k are clearly incorrect. The values for Canopy, > MeanShift and Dirichlet are not so obviously incorrect but I remain > suspicious. Something must have become broken in the recent clustering > refactoring. > > From the method CDbwEvaluator.invalidCluster comment (used to enable pruning): > * Return if the cluster is valid. Valid clusters must have more than 2 > representative points, > * and at least one of them must be different than the cluster center. This > is because the > * representative points extraction will duplicate the cluster center if it > is empty. > > Oddly enough, inspection of the test log indicates that only k-means and > fuzzy-k are not pruning clusters. Clearly some more investigation is needed. > I will take a look at it tomorrow. In the mean time if you develop any > additional insight please do share it with us. > > Thanks, > Jeff > > On 5/17/12 3:53 PM, Pat Ferrel wrote: >> I built a tool that iterates through a list of values for k on the same data >> and spits out the CDbw and ClusterEvaluator results each time. >> >> When the evaluator or CDbw prunes a cluster, how do I interpret that? They >> seem to throw out the same clusters on a given run. Also CDbw always returns >> an inter-cluster density of 0? >> >> On 5/17/12 5:58 AM, Jeff Eastman wrote: >>> Yes, that is the paper I used to implement CDbw. I've tried it a few times >>> along with the simpler ClusterEvaluator metrics I took from Mahout In >>> Action and they look to be reasonable - see the tests - though I have no >>> way to judge their absolute values. Anything you can contribute in this >>> area would be most welcome. Perhaps a wiki page? >>> >>> >>> On 5/16/12 1:14 PM, Pat Ferrel wrote: >>>> The reference was in the code for >>>> http://www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf >>>> >>>> On 5/16/12 9:56 AM, Pat Ferrel wrote: >>>>> Thanks, I've been looking at that. Is there a description of how to >>>>> interpret those values? An academic paper maybe? The intra-cluster >>>>> distance intuitively seems to correspond to something like cohesion. I >>>>> don't get the intuition behind inter-cluster distances but Ted thinks >>>>> they are the most important. >>>>> >>>>> On 5/16/12 7:32 AM, Jeff Eastman wrote: >>>>>> Mahout has a ClusterEvaluator and a CDbwEvaluator that compute some >>>>>> quality metrics (inter-cluster distance, intra-cluster-distance, ...) >>>>>> that you may find useful. Both calculate a set of representative points >>>>>> from the clustering output and compute the (n^2) metrics over these >>>>>> points rather than all of the points in each cluster. >>>>>> >>>>>> On 5/15/12 4:46 PM, Pat Ferrel wrote: >>>>>>> So many questions about best k, how to choose t1 and t2, how much help >>>>>>> is dimensional reduction would have clear answers if we had a way to >>>>>>> judge the quality of clusters. >>>>>>> >>>>>>> Various methods were discussed here for a time: >>>>>>> http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output >>>>>>> >>>>>>> Has there been any work on building a measure of quality? >>>>>>> >>>>>>> >>>>>> >>>> >>>> >>> >> >> >
