I can't comment on the existing evaluators, but for me the only real
measure that I care about is average distance to nearest cluster for new or
held-out data.  I will be building something of this sort for the
clustering part of the knn code I have been working on.

On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[email protected]> wrote:

>  To use something like kmeans on any large and changing data set it seems
> a requirement that there be some means of evaluating the quality of
> clusters at different scales. The usual eyeballing breaks down quickly.
>
> Trying to use the cluster evaluators in Mahout with kmeans as the
> clustering method and cosine and the distance measure has proven
> problematic. The method is to iterate through the data using different ks
> and performing the evaluation at each point. What I find is that certain
> values are almost always in error. The Intra-cluster density from
> ClusterEvaluator is almost always NaN. The CDbw  inter-cluster density is
> almost always 0. I have also seen several cases where CDbw fails to return
> any results but have not tracked down why yet.
>
> Given that the data for either evaluator is usually incomplete these
> methods are not very useful. Is mahout dropping the evaluators? Is the
> general wisdom that they are not particularly useful? Should a newer method
> be pursued? This seems a fairly important question to me, am I missing
> something?
>
> Raw data for a sample crawl is given below:
>
>
>
>

Reply via email to