Not sure of the subtleties of the Dirichlet distribution but rDirichlet in 
UncommonDistributions adds the alpha0 value to the total counts when it samples 
from the Beta distribution. In the first iteration, when the total counts are 
zero, it increases the probability of choosing a new cluster. During subsequent 
iterations, it is completely overshadowed by the total counts.

-----Original Message-----
From: Timothy Potter [mailto:[email protected]] 
Sent: Friday, February 25, 2011 9:41 AM
To: [email protected]
Subject: Re: Dirichlet clustering woes ...

Quick update -- making some progress with this by increasing -a0 to 10
instead of 1 ... The first iteration completed successfully in 1 hr 8 mins.

I had 72 map tasks and 12 reducers; the reducers completely roughly at the
same time.

However, I'm not out of the woods yet as the map tasks seem pretty bogged
down in Iteration 2. The number of vectors per cluster from Iteration 1 are
included below.

I also want to try the L1Model as suggested by Jeff.

Any tips on where I can learn more about why raising -a0 to 10 caused the
input vectors to be more evenly distributed over the initial prior clusters?

Thanks for your help.

Distribution of Vectors per cluster after 1 Dirichlet Iteration:

   ID Num Vecs  :C-0: 621236  :C-1: 502712  :C-5: 397233  :C-2: 396496
:C-3: 369936  :C-4: 361496  :C-6: 290305  :C-7: 277959  :C-9: 277152  :C-8:
248298  :C-12: 194878  :C-10: 192341  :C-11: 180626  :C-13: 149143  :C-14:
136651  :C-15: 125184  :C-17: 115815  :C-16: 107250  :C-18: 106541  :C-19:
92748  :C-21: 80788  :C-20: 72520  :C-24: 68924  :C-23: 66936  :C-22: 64589
:C-25: 60714  :C-26: 59370  :C-27: 47513  :C-28: 34267  :C-29: 33357  :C-30:
32002  :C-31: 30125  :C-32: 28909  :C-33: 24937  :C-36: 23991  :C-35: 22988
:C-38: 17363  :C-34: 16684  :C-37: 15835  :C-40: 13528  :C-39: 11476  :C-42:
11118  :C-44: 10630  :C-41: 9611  :C-43: 8736  :C-46: 8707  :C-45: 8371
:C-47: 7570  :C-49: 5138  :C-48: 4979  :C-50: 4378  :C-53: 4288  :C-51: 4001
:C-52: 3727  :C-54: 3146  :C-55: 2730  :C-56: 2528  :C-58: 2401  :C-57: 2098
:C-59: 1964

On Thu, Feb 24, 2011 at 5:02 PM, Jeff Eastman <[email protected]> wrote:

> It indicates the prior cluster centers (as initialized by the
> ModelDistribution) and std are waaaay off target.
>
> -----Original Message-----
> From: Ted Dunning [mailto:[email protected]]
> Sent: Thursday, February 24, 2011 3:47 PM
> To: [email protected]
> Cc: Timothy Potter
> Subject: Re: Dirichlet clustering woes ...
>
> This sounds like a classic case of a monster cluster.
>
> On Thu, Feb 24, 2011 at 3:31 PM, Timothy Potter <[email protected]
> >wrote:
>
> > Intuitively, your comment about all points being assigned to one cluster
> > makes sense because we get through the map tasks and all the reducers
> > except
> > one very quickly ... and then it bogs down.
> >
>

Reply via email to