Re: Cluster Evaluation 0.8 style

Pat Ferrel Wed, 11 Jul 2012 10:21:56 -0700

The average distance to the nearest cluster measures overall clumpinessfound at a particular scale but does not address the cohesiveness of anyparticular clump. In any real world data set some clusters will becohesive and some not. This happens for at least two reasons; some datadoes not clump, and there are multiple scales for clumpiness. This is animportant distinction I believe and implies the need for a cohesivenessper cluster evaluation.

It was my understanding that the ClusterEvaluator included an attempt toprovide this measure with intra-cluster density per cluster though itlooks like that output has been removed?


On 7/8/12 6:07 PM, Ted Dunning wrote:

I can't comment on the existing evaluators, but for me the only realmeasure that I care about is average distance to nearest cluster fornew or held-out data. I will be building something of this sort forthe clustering part of the knn code I have been working on.

On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[email protected]<mailto:[email protected]>> wrote:


    To use something like kmeans on any large and changing data set it
    seems a requirement that there be some means of evaluating the
    quality of clusters at different scales. The usual eyeballing
    breaks down quickly.

    Trying to use the cluster evaluators in Mahout with kmeans as the
    clustering method and cosine and the distance measure has proven
    problematic. The method is to iterate through the data using
    different ks and performing the evaluation at each point. What I
    find is that certain values are almost always in error. The
    Intra-cluster density from ClusterEvaluator is almost always NaN.
    The CDbw inter-cluster density is almost always 0. I have also
    seen several cases where CDbw fails to return any results but have
    not tracked down why yet.

    Given that the data for either evaluator is usually incomplete
    these methods are not very useful. Is mahout dropping the
    evaluators? Is the general wisdom that they are not particularly
    useful? Should a newer method be pursued? This seems a fairly
    important question to me, am I missing something?

    Raw data for a sample crawl is given below:

Re: Cluster Evaluation 0.8 style

Reply via email to