I'm surprised too. It looks like you are creating 60 clusters which is completely reasonable. During map processing, each point is compared to each cluster to generate its pdf() and the point is assigned to one of the clusters using a multinomial over all the pdfs. If you have many points assigned to one of your clusters by this process then the copy-merge step could take a while to build the reducer input for that cluster. How many mappers are being created from your dataset?
The reducers then accumulate the posterior statistics for one or more clusters. You can try increasing the number of reducers (up to k) which can help with this step. Again, if most of your points are being assigned to a single cluster that reducer will be bogged down observing them all. Also, since the models accumulate Gaussian statistics to compute mean and std posterior values these values will tend to become denser as many vectors are summed and this can drive up memory consumption during the reduce step. You might try increasing the value of -k to spread the vectors over more clusters. Adjusting the value of -a0 could also cause input vectors to be more evenly distributed over the initial prior clusters (which have random center vectors). For text, you might find that the L1Model with a CosineDistanceMeasure could work better than the default NormalModelDistribution. You are breaking new ground here. I've run Dirichlet over Reuters and it seemed to work ok at that scale. Jeff -----Original Message----- From: Ted Dunning [mailto:[email protected]] Sent: Thursday, February 24, 2011 2:26 PM To: [email protected] Cc: Timothy Potter Subject: Re: Dirichlet clustering woes ... Do you have any stats about how many clusters there are and whether a vast number of points are being assigned to a single cluster? I am a little surprised at your results since the Dirichlet clustering doesn't have any tall poles (that I know of). Every point is compared to every cluster and contributes to every cluster. As such, stragglers shouldn't be a big deal. Did you check the usual suspects with respect to swapping and GC? On Thu, Feb 24, 2011 at 2:18 PM, Timothy Potter <[email protected]>wrote: > My colleague Szymon and I have been working on Mahout-588 and hoped to > include Dirichlet in our clustering benchmarks, but unfortunately have not > had much success. So we're reaching out to the community to see if anyone > else has been successful with somewhat large-scale Dirichlet clustering. > > Specifically, we have 6,077,604 sparse TFIDF vectors generated from the > Apache Mail Archives. > > Using vectors with 40K dimensions on a 5-node cluster it runs nicely until > map-100% and reduce-92%. and than it virtually stops. it takes 3min to 93%, > 7min to get 94%, 23min to get 95%, 1:12 to 96% and after another 4h > nothing. > The CPUs at the nodes run with almost 100% and full 6GB. > > So then we tried vectors with 20K dimensions and were able to complete 1 > iteration after 7 hrs 32 mins. The last 3% of reduce was running 1h each > percent, i had 4 working nodes (+1 namenode), Xmx2500 and max num of > reducers set to 1. > > The job args we're using are: > > bin/mahout dirichlet \ > -i /asf-mail-archives/mahout-0.4/tfidf-vectors/ \ > -o /asf-mail-archives/mahout-0.4/dirichlet/ \ > -a0 1.0 \ > -x 10 \ > --distanceMeasure > org.apache.mahout.common.distance.CosineDistanceMeasure \ > -k 60 > > > We're still studying the code to diagnose ourselves, but also wanted to get > some feedback. > > Kind regards, > > Timothy Potter > [email protected] >
