Power law scaling is very rare to observe directly in k-means clusters
because the algorithm tends to force them to be the same physical size.

Bayesian non-parametric clustering algorithms can show some scaling
effects, but it is very difficult to see very many clusters so it is very
difficult to demonstrate self-similar scaling over a very large size range.

If you want to try, just produce a Zipf-plot (plot size rank versus size on
log-log).  Look for linearity.

On Mon, Jul 9, 2012 at 12:34 AM, Lance Norskog <[email protected]> wrote:

> Power law size scaling.
>
> On Sun, Jul 8, 2012 at 11:39 PM, Ted Dunning <[email protected]>
> wrote:
> > What do you mean by self similarity?  Power law size scaling?  Or that
> two successive clusterings get nearly the same answer?
> >
> > Sent from my iPhone
> >
> > On Jul 8, 2012, at 8:40 PM, Lance Norskog <[email protected]> wrote:
> >
> >> Are there any measures of self-similarity?
> >>
> >> On Sun, Jul 8, 2012 at 6:07 PM, Ted Dunning <[email protected]>
> wrote:
> >>
> >>> I can't comment on the existing evaluators, but for me the only real
> >>> measure that I care about is average distance to nearest cluster for
> new or
> >>> held-out data.  I will be building something of this sort for the
> >>> clustering part of the knn code I have been working on.
> >>>
> >>>
> >>> On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[email protected]>
> wrote:
> >>>
> >>>> To use something like kmeans on any large and changing data set it
> seems
> >>>> a requirement that there be some means of evaluating the quality of
> >>>> clusters at different scales. The usual eyeballing breaks down
> quickly.
> >>>>
> >>>> Trying to use the cluster evaluators in Mahout with kmeans as the
> >>>> clustering method and cosine and the distance measure has proven
> >>>> problematic. The method is to iterate through the data using
> different ks
> >>>> and performing the evaluation at each point. What I find is that
> certain
> >>>> values are almost always in error. The Intra-cluster density from
> >>>> ClusterEvaluator is almost always NaN. The CDbw  inter-cluster
> density is
> >>>> almost always 0. I have also seen several cases where CDbw fails to
> return
> >>>> any results but have not tracked down why yet.
> >>>>
> >>>> Given that the data for either evaluator is usually incomplete these
> >>>> methods are not very useful. Is mahout dropping the evaluators? Is the
> >>>> general wisdom that they are not particularly useful? Should a newer
> method
> >>>> be pursued? This seems a fairly important question to me, am I missing
> >>>> something?
> >>>>
> >>>> Raw data for a sample crawl is given below:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >> --
> >> Lance Norskog
> >> [email protected]
>
>
>
> --
> Lance Norskog
> [email protected]
>

Reply via email to