The average distance to the nearest cluster measures overall clumpiness
found at a particular scale but does not address the cohesiveness of any
particular clump. In any real world data set some clusters will be
cohesive and some not. This happens for at least two reasons; some data
does not clump, and there are multiple scales for clumpiness. This is an
important distinction I believe and implies the need for a cohesiveness
per cluster evaluation.
It was my understanding that the ClusterEvaluator included an attempt to
provide this measure with intra-cluster density per cluster though it
looks like that output has been removed?
On 7/8/12 6:07 PM, Ted Dunning wrote:
I can't comment on the existing evaluators, but for me the only real
measure that I care about is average distance to nearest cluster for
new or held-out data. I will be building something of this sort for
the clustering part of the knn code I have been working on.
On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[email protected]
<mailto:[email protected]>> wrote:
To use something like kmeans on any large and changing data set it
seems a requirement that there be some means of evaluating the
quality of clusters at different scales. The usual eyeballing
breaks down quickly.
Trying to use the cluster evaluators in Mahout with kmeans as the
clustering method and cosine and the distance measure has proven
problematic. The method is to iterate through the data using
different ks and performing the evaluation at each point. What I
find is that certain values are almost always in error. The
Intra-cluster density from ClusterEvaluator is almost always NaN.
The CDbw inter-cluster density is almost always 0. I have also
seen several cases where CDbw fails to return any results but have
not tracked down why yet.
Given that the data for either evaluator is usually incomplete
these methods are not very useful. Is mahout dropping the
evaluators? Is the general wisdom that they are not particularly
useful? Should a newer method be pursued? This seems a fairly
important question to me, am I missing something?
Raw data for a sample crawl is given below: