If you plot these you will see an exponential distribution of cluster size
that fits

exp(13.23 + -0.09483*x$cluster)

It is mildly interesting that this isn't a power law, but you have the same
take-away.  The
second pass and later passes are going to have a problem with key skew.

On Fri, Feb 25, 2011 at 9:40 AM, Timothy Potter <[email protected]>wrote:

> Quick update -- making some progress with this by increasing -a0 to 10
> instead of 1 ... The first iteration completed successfully in 1 hr 8 mins.
>
> I had 72 map tasks and 12 reducers; the reducers completely roughly at the
> same time.
>
> However, I'm not out of the woods yet as the map tasks seem pretty bogged
> down in Iteration 2. The number of vectors per cluster from Iteration 1 are
> included below.
>
> I also want to try the L1Model as suggested by Jeff.
>
> Any tips on where I can learn more about why raising -a0 to 10 caused the
> input vectors to be more evenly distributed over the initial prior
> clusters?
>
> Thanks for your help.
>
> Distribution of Vectors per cluster after 1 Dirichlet Iteration:
>
>   ID Num Vecs  :C-0: 621236  :C-1: 502712  :C-5: 397233  :C-2: 396496
> :C-3: 369936  :C-4: 361496  :C-6: 290305  :C-7: 277959  :C-9: 277152  :C-8:
> 248298  :C-12: 194878  :C-10: 192341  :C-11: 180626  :C-13: 149143  :C-14:
> 136651  :C-15: 125184  :C-17: 115815  :C-16: 107250  :C-18: 106541  :C-19:
> 92748  :C-21: 80788  :C-20: 72520  :C-24: 68924  :C-23: 66936  :C-22: 64589
> :C-25: 60714  :C-26: 59370  :C-27: 47513  :C-28: 34267  :C-29: 33357
>  :C-30:
> 32002  :C-31: 30125  :C-32: 28909  :C-33: 24937  :C-36: 23991  :C-35: 22988
> :C-38: 17363  :C-34: 16684  :C-37: 15835  :C-40: 13528  :C-39: 11476
>  :C-42:
> 11118  :C-44: 10630  :C-41: 9611  :C-43: 8736  :C-46: 8707  :C-45: 8371
> :C-47: 7570  :C-49: 5138  :C-48: 4979  :C-50: 4378  :C-53: 4288  :C-51:
> 4001
> :C-52: 3727  :C-54: 3146  :C-55: 2730  :C-56: 2528  :C-58: 2401  :C-57:
> 2098
> :C-59: 1964
>
> On Thu, Feb 24, 2011 at 5:02 PM, Jeff Eastman <[email protected]> wrote:
>
> > It indicates the prior cluster centers (as initialized by the
> > ModelDistribution) and std are waaaay off target.
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:[email protected]]
> > Sent: Thursday, February 24, 2011 3:47 PM
> > To: [email protected]
> > Cc: Timothy Potter
> > Subject: Re: Dirichlet clustering woes ...
> >
> > This sounds like a classic case of a monster cluster.
> >
> > On Thu, Feb 24, 2011 at 3:31 PM, Timothy Potter <[email protected]
> > >wrote:
> >
> > > Intuitively, your comment about all points being assigned to one
> cluster
> > > makes sense because we get through the map tasks and all the reducers
> > > except
> > > one very quickly ... and then it bogs down.
> > >
> >
>

Reply via email to