Not sure of the subtleties of the Dirichlet distribution but rDirichlet in UncommonDistributions adds the alpha0 value to the total counts when it samples from the Beta distribution. In the first iteration, when the total counts are zero, it increases the probability of choosing a new cluster. During subsequent iterations, it is completely overshadowed by the total counts.
-----Original Message----- From: Timothy Potter [mailto:[email protected]] Sent: Friday, February 25, 2011 9:41 AM To: [email protected] Subject: Re: Dirichlet clustering woes ... Quick update -- making some progress with this by increasing -a0 to 10 instead of 1 ... The first iteration completed successfully in 1 hr 8 mins. I had 72 map tasks and 12 reducers; the reducers completely roughly at the same time. However, I'm not out of the woods yet as the map tasks seem pretty bogged down in Iteration 2. The number of vectors per cluster from Iteration 1 are included below. I also want to try the L1Model as suggested by Jeff. Any tips on where I can learn more about why raising -a0 to 10 caused the input vectors to be more evenly distributed over the initial prior clusters? Thanks for your help. Distribution of Vectors per cluster after 1 Dirichlet Iteration: ID Num Vecs :C-0: 621236 :C-1: 502712 :C-5: 397233 :C-2: 396496 :C-3: 369936 :C-4: 361496 :C-6: 290305 :C-7: 277959 :C-9: 277152 :C-8: 248298 :C-12: 194878 :C-10: 192341 :C-11: 180626 :C-13: 149143 :C-14: 136651 :C-15: 125184 :C-17: 115815 :C-16: 107250 :C-18: 106541 :C-19: 92748 :C-21: 80788 :C-20: 72520 :C-24: 68924 :C-23: 66936 :C-22: 64589 :C-25: 60714 :C-26: 59370 :C-27: 47513 :C-28: 34267 :C-29: 33357 :C-30: 32002 :C-31: 30125 :C-32: 28909 :C-33: 24937 :C-36: 23991 :C-35: 22988 :C-38: 17363 :C-34: 16684 :C-37: 15835 :C-40: 13528 :C-39: 11476 :C-42: 11118 :C-44: 10630 :C-41: 9611 :C-43: 8736 :C-46: 8707 :C-45: 8371 :C-47: 7570 :C-49: 5138 :C-48: 4979 :C-50: 4378 :C-53: 4288 :C-51: 4001 :C-52: 3727 :C-54: 3146 :C-55: 2730 :C-56: 2528 :C-58: 2401 :C-57: 2098 :C-59: 1964 On Thu, Feb 24, 2011 at 5:02 PM, Jeff Eastman <[email protected]> wrote: > It indicates the prior cluster centers (as initialized by the > ModelDistribution) and std are waaaay off target. > > -----Original Message----- > From: Ted Dunning [mailto:[email protected]] > Sent: Thursday, February 24, 2011 3:47 PM > To: [email protected] > Cc: Timothy Potter > Subject: Re: Dirichlet clustering woes ... > > This sounds like a classic case of a monster cluster. > > On Thu, Feb 24, 2011 at 3:31 PM, Timothy Potter <[email protected] > >wrote: > > > Intuitively, your comment about all points being assigned to one cluster > > makes sense because we get through the map tasks and all the reducers > > except > > one very quickly ... and then it bogs down. > > >
