I think that he means cluster sizes rather than term weights. For text, term frequencies follow an approximate power law.
On Mon, Jul 9, 2012 at 10:06 AM, Pat Ferrel <[email protected]> wrote: > Sorry, I'm not following this shorthand. Are you asking if the term > weights of each centroid follow a power law, like they are supposed to? > > On 7/9/12 12:34 AM, Lance Norskog wrote: > >> Power law size scaling. >> >> On Sun, Jul 8, 2012 at 11:39 PM, Ted Dunning <[email protected]> >> wrote: >> >>> What do you mean by self similarity? Power law size scaling? Or that >>> two successive clusterings get nearly the same answer? >>> >>> Sent from my iPhone >>> >>> On Jul 8, 2012, at 8:40 PM, Lance Norskog <[email protected]> wrote: >>> >>> Are there any measures of self-similarity? >>>> >>>> On Sun, Jul 8, 2012 at 6:07 PM, Ted Dunning <[email protected]> >>>> wrote: >>>> >>>> I can't comment on the existing evaluators, but for me the only real >>>>> measure that I care about is average distance to nearest cluster for >>>>> new or >>>>> held-out data. I will be building something of this sort for the >>>>> clustering part of the knn code I have been working on. >>>>> >>>>> >>>>> On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[email protected]> >>>>> wrote: >>>>> >>>>> To use something like kmeans on any large and changing data set it >>>>>> seems >>>>>> a requirement that there be some means of evaluating the quality of >>>>>> clusters at different scales. The usual eyeballing breaks down >>>>>> quickly. >>>>>> >>>>>> Trying to use the cluster evaluators in Mahout with kmeans as the >>>>>> clustering method and cosine and the distance measure has proven >>>>>> problematic. The method is to iterate through the data using >>>>>> different ks >>>>>> and performing the evaluation at each point. What I find is that >>>>>> certain >>>>>> values are almost always in error. The Intra-cluster density from >>>>>> ClusterEvaluator is almost always NaN. The CDbw inter-cluster >>>>>> density is >>>>>> almost always 0. I have also seen several cases where CDbw fails to >>>>>> return >>>>>> any results but have not tracked down why yet. >>>>>> >>>>>> Given that the data for either evaluator is usually incomplete these >>>>>> methods are not very useful. Is mahout dropping the evaluators? Is the >>>>>> general wisdom that they are not particularly useful? Should a newer >>>>>> method >>>>>> be pursued? This seems a fairly important question to me, am I missing >>>>>> something? >>>>>> >>>>>> Raw data for a sample crawl is given below: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>> -- >>>> Lance Norskog >>>> [email protected] >>>> >>> >> >> > >
