Power law scaling is very rare to observe directly in k-means clusters because the algorithm tends to force them to be the same physical size.
Bayesian non-parametric clustering algorithms can show some scaling effects, but it is very difficult to see very many clusters so it is very difficult to demonstrate self-similar scaling over a very large size range. If you want to try, just produce a Zipf-plot (plot size rank versus size on log-log). Look for linearity. On Mon, Jul 9, 2012 at 12:34 AM, Lance Norskog <[email protected]> wrote: > Power law size scaling. > > On Sun, Jul 8, 2012 at 11:39 PM, Ted Dunning <[email protected]> > wrote: > > What do you mean by self similarity? Power law size scaling? Or that > two successive clusterings get nearly the same answer? > > > > Sent from my iPhone > > > > On Jul 8, 2012, at 8:40 PM, Lance Norskog <[email protected]> wrote: > > > >> Are there any measures of self-similarity? > >> > >> On Sun, Jul 8, 2012 at 6:07 PM, Ted Dunning <[email protected]> > wrote: > >> > >>> I can't comment on the existing evaluators, but for me the only real > >>> measure that I care about is average distance to nearest cluster for > new or > >>> held-out data. I will be building something of this sort for the > >>> clustering part of the knn code I have been working on. > >>> > >>> > >>> On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[email protected]> > wrote: > >>> > >>>> To use something like kmeans on any large and changing data set it > seems > >>>> a requirement that there be some means of evaluating the quality of > >>>> clusters at different scales. The usual eyeballing breaks down > quickly. > >>>> > >>>> Trying to use the cluster evaluators in Mahout with kmeans as the > >>>> clustering method and cosine and the distance measure has proven > >>>> problematic. The method is to iterate through the data using > different ks > >>>> and performing the evaluation at each point. What I find is that > certain > >>>> values are almost always in error. The Intra-cluster density from > >>>> ClusterEvaluator is almost always NaN. The CDbw inter-cluster > density is > >>>> almost always 0. I have also seen several cases where CDbw fails to > return > >>>> any results but have not tracked down why yet. > >>>> > >>>> Given that the data for either evaluator is usually incomplete these > >>>> methods are not very useful. Is mahout dropping the evaluators? Is the > >>>> general wisdom that they are not particularly useful? Should a newer > method > >>>> be pursued? This seems a fairly important question to me, am I missing > >>>> something? > >>>> > >>>> Raw data for a sample crawl is given below: > >>>> > >>>> > >>>> > >>>> > >>> > >> > >> > >> -- > >> Lance Norskog > >> [email protected] > > > > -- > Lance Norskog > [email protected] >
