If you plot these you will see an exponential distribution of cluster size that fits
exp(13.23 + -0.09483*x$cluster) It is mildly interesting that this isn't a power law, but you have the same take-away. The second pass and later passes are going to have a problem with key skew. On Fri, Feb 25, 2011 at 9:40 AM, Timothy Potter <[email protected]>wrote: > Quick update -- making some progress with this by increasing -a0 to 10 > instead of 1 ... The first iteration completed successfully in 1 hr 8 mins. > > I had 72 map tasks and 12 reducers; the reducers completely roughly at the > same time. > > However, I'm not out of the woods yet as the map tasks seem pretty bogged > down in Iteration 2. The number of vectors per cluster from Iteration 1 are > included below. > > I also want to try the L1Model as suggested by Jeff. > > Any tips on where I can learn more about why raising -a0 to 10 caused the > input vectors to be more evenly distributed over the initial prior > clusters? > > Thanks for your help. > > Distribution of Vectors per cluster after 1 Dirichlet Iteration: > > ID Num Vecs :C-0: 621236 :C-1: 502712 :C-5: 397233 :C-2: 396496 > :C-3: 369936 :C-4: 361496 :C-6: 290305 :C-7: 277959 :C-9: 277152 :C-8: > 248298 :C-12: 194878 :C-10: 192341 :C-11: 180626 :C-13: 149143 :C-14: > 136651 :C-15: 125184 :C-17: 115815 :C-16: 107250 :C-18: 106541 :C-19: > 92748 :C-21: 80788 :C-20: 72520 :C-24: 68924 :C-23: 66936 :C-22: 64589 > :C-25: 60714 :C-26: 59370 :C-27: 47513 :C-28: 34267 :C-29: 33357 > :C-30: > 32002 :C-31: 30125 :C-32: 28909 :C-33: 24937 :C-36: 23991 :C-35: 22988 > :C-38: 17363 :C-34: 16684 :C-37: 15835 :C-40: 13528 :C-39: 11476 > :C-42: > 11118 :C-44: 10630 :C-41: 9611 :C-43: 8736 :C-46: 8707 :C-45: 8371 > :C-47: 7570 :C-49: 5138 :C-48: 4979 :C-50: 4378 :C-53: 4288 :C-51: > 4001 > :C-52: 3727 :C-54: 3146 :C-55: 2730 :C-56: 2528 :C-58: 2401 :C-57: > 2098 > :C-59: 1964 > > On Thu, Feb 24, 2011 at 5:02 PM, Jeff Eastman <[email protected]> wrote: > > > It indicates the prior cluster centers (as initialized by the > > ModelDistribution) and std are waaaay off target. > > > > -----Original Message----- > > From: Ted Dunning [mailto:[email protected]] > > Sent: Thursday, February 24, 2011 3:47 PM > > To: [email protected] > > Cc: Timothy Potter > > Subject: Re: Dirichlet clustering woes ... > > > > This sounds like a classic case of a monster cluster. > > > > On Thu, Feb 24, 2011 at 3:31 PM, Timothy Potter <[email protected] > > >wrote: > > > > > Intuitively, your comment about all points being assigned to one > cluster > > > makes sense because we get through the map tasks and all the reducers > > > except > > > one very quickly ... and then it bogs down. > > > > > >
