Can you rephrase that question? I do a rowsimilarity measure for the
docs excluding self-similarity but I doubt that is what you are asking.
Are you asking if I do a similarity calc on clusters? I'm planning to
find clusters that are similar using their centroids. This is to create
a sort of graph clustering model mixing different clustering scales
(different ks) but I'd like to have a way to discard poor quality
clusters from the calc.
On 7/8/12 8:40 PM, Lance Norskog wrote:
Are there any measures of self-similarity?
On Sun, Jul 8, 2012 at 6:07 PM, Ted Dunning <[email protected]> wrote:
I can't comment on the existing evaluators, but for me the only real
measure that I care about is average distance to nearest cluster for new or
held-out data. I will be building something of this sort for the
clustering part of the knn code I have been working on.
On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[email protected]> wrote:
To use something like kmeans on any large and changing data set it seems
a requirement that there be some means of evaluating the quality of
clusters at different scales. The usual eyeballing breaks down quickly.
Trying to use the cluster evaluators in Mahout with kmeans as the
clustering method and cosine and the distance measure has proven
problematic. The method is to iterate through the data using different ks
and performing the evaluation at each point. What I find is that certain
values are almost always in error. The Intra-cluster density from
ClusterEvaluator is almost always NaN. The CDbw inter-cluster density is
almost always 0. I have also seen several cases where CDbw fails to return
any results but have not tracked down why yet.
Given that the data for either evaluator is usually incomplete these
methods are not very useful. Is mahout dropping the evaluators? Is the
general wisdom that they are not particularly useful? Should a newer method
be pursued? This seems a fairly important question to me, am I missing
something?
Raw data for a sample crawl is given below: