The average distance to the nearest cluster measures overall clumpiness found at a particular scale but does not address the cohesiveness of any particular clump. In any real world data set some clusters will be cohesive and some not. This happens for at least two reasons; some data does not clump, and there are multiple scales for clumpiness. This is an important distinction I believe and implies the need for a cohesiveness per cluster evaluation.

It was my understanding that the ClusterEvaluator included an attempt to provide this measure with intra-cluster density per cluster though it looks like that output has been removed?

On 7/8/12 6:07 PM, Ted Dunning wrote:
I can't comment on the existing evaluators, but for me the only real measure that I care about is average distance to nearest cluster for new or held-out data. I will be building something of this sort for the clustering part of the knn code I have been working on.

On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[email protected] <mailto:[email protected]>> wrote:

    To use something like kmeans on any large and changing data set it
    seems a requirement that there be some means of evaluating the
    quality of clusters at different scales. The usual eyeballing
    breaks down quickly.

    Trying to use the cluster evaluators in Mahout with kmeans as the
    clustering method and cosine and the distance measure has proven
    problematic. The method is to iterate through the data using
    different ks and performing the evaluation at each point. What I
    find is that certain values are almost always in error. The
    Intra-cluster density from ClusterEvaluator is almost always NaN.
    The CDbw inter-cluster density is almost always 0. I have also
    seen several cases where CDbw fails to return any results but have
    not tracked down why yet.

    Given that the data for either evaluator is usually incomplete
    these methods are not very useful. Is mahout dropping the
    evaluators? Is the general wisdom that they are not particularly
    useful? Should a newer method be pursued? This seems a fairly
    important question to me, am I missing something?

    Raw data for a sample crawl is given below:






Reply via email to