Re: Judging the quality of clustering

Pat Ferrel Thu, 17 May 2012 16:07:55 -0700

I'm only on 0.6, nothing very recent.

Sent from my iPhone


On May 17, 2012, at 2:33 PM, Jeff Eastman <[email protected]> wrote:

> Hi Pat,
> 
> I don't have a good answer here. Evidently, something in CDbw has become 
> broken and you are the first to notice. When I run TestCDbwEvaluator, the 
> values for k-means and fuzzy-k are clearly incorrect. The values for Canopy, 
> MeanShift and Dirichlet are not so obviously incorrect but I remain 
> suspicious. Something must have become broken in the recent clustering 
> refactoring.
> 
> From the method CDbwEvaluator.invalidCluster comment (used to enable pruning):
>   * Return if the cluster is valid. Valid clusters must have more than 2 
> representative points,
>   * and at least one of them must be different than the cluster center. This 
> is because the
>   * representative points extraction will duplicate the cluster center if it 
> is empty.
> 
> Oddly enough, inspection of the test log indicates that only k-means and 
> fuzzy-k are not pruning clusters. Clearly some more investigation is needed. 
> I will take a look at it tomorrow. In the mean time if you develop any 
> additional insight please do share it with us.
> 
> Thanks,
> Jeff
> 
> On 5/17/12 3:53 PM, Pat Ferrel wrote:
>> I built a tool that iterates through a list of values for k on the same data 
>> and spits out the CDbw and ClusterEvaluator results each time.
>> 
>> When the evaluator or CDbw prunes a cluster, how do I interpret that? They 
>> seem to throw out the same clusters on a given run. Also CDbw always returns 
>> an inter-cluster density of 0?
>> 
>> On 5/17/12 5:58 AM, Jeff Eastman wrote:
>>> Yes, that is the paper I used to implement CDbw. I've tried it a few times 
>>> along with the simpler ClusterEvaluator metrics I took from Mahout In 
>>> Action and they look to be reasonable - see the tests - though I have no 
>>> way to judge their absolute values. Anything you can contribute in this 
>>> area would be most welcome. Perhaps a wiki page?
>>> 
>>> 
>>> On 5/16/12 1:14 PM, Pat Ferrel wrote:
>>>> The reference was in the code for 
>>>> http://www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf
>>>> 
>>>> On 5/16/12 9:56 AM, Pat Ferrel wrote:
>>>>> Thanks, I've been looking at that. Is there a description of how to 
>>>>> interpret those values? An academic paper maybe? The intra-cluster 
>>>>> distance intuitively seems to correspond to something like cohesion. I 
>>>>> don't get the intuition behind inter-cluster distances but Ted thinks 
>>>>> they are the most important.
>>>>> 
>>>>> On 5/16/12 7:32 AM, Jeff Eastman wrote:
>>>>>> Mahout has a ClusterEvaluator and a CDbwEvaluator that compute some 
>>>>>> quality metrics (inter-cluster distance, intra-cluster-distance, ...) 
>>>>>> that you may find useful. Both calculate a set of representative points 
>>>>>> from the clustering output and compute the (n^2) metrics over these 
>>>>>> points rather than all of the points in each cluster.
>>>>>> 
>>>>>> On 5/15/12 4:46 PM, Pat Ferrel wrote:
>>>>>>> So many questions about best k, how to choose t1 and t2, how much help 
>>>>>>> is dimensional reduction would have clear answers if we had a way to 
>>>>>>> judge the quality of clusters.
>>>>>>> 
>>>>>>> Various methods were discussed here for a time: 
>>>>>>> http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
>>>>>>> 
>>>>>>> Has there been any work on building a measure of quality?
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>

Re: Judging the quality of clustering

Reply via email to