Do you have any stats about how many clusters there are and whether a vast number of points are being assigned to a single cluster?
I am a little surprised at your results since the Dirichlet clustering doesn't have any tall poles (that I know of). Every point is compared to every cluster and contributes to every cluster. As such, stragglers shouldn't be a big deal. Did you check the usual suspects with respect to swapping and GC? On Thu, Feb 24, 2011 at 2:18 PM, Timothy Potter <[email protected]>wrote: > My colleague Szymon and I have been working on Mahout-588 and hoped to > include Dirichlet in our clustering benchmarks, but unfortunately have not > had much success. So we're reaching out to the community to see if anyone > else has been successful with somewhat large-scale Dirichlet clustering. > > Specifically, we have 6,077,604 sparse TFIDF vectors generated from the > Apache Mail Archives. > > Using vectors with 40K dimensions on a 5-node cluster it runs nicely until > map-100% and reduce-92%. and than it virtually stops. it takes 3min to 93%, > 7min to get 94%, 23min to get 95%, 1:12 to 96% and after another 4h > nothing. > The CPUs at the nodes run with almost 100% and full 6GB. > > So then we tried vectors with 20K dimensions and were able to complete 1 > iteration after 7 hrs 32 mins. The last 3% of reduce was running 1h each > percent, i had 4 working nodes (+1 namenode), Xmx2500 and max num of > reducers set to 1. > > The job args we're using are: > > bin/mahout dirichlet \ > -i /asf-mail-archives/mahout-0.4/tfidf-vectors/ \ > -o /asf-mail-archives/mahout-0.4/dirichlet/ \ > -a0 1.0 \ > -x 10 \ > --distanceMeasure > org.apache.mahout.common.distance.CosineDistanceMeasure \ > -k 60 > > > We're still studying the code to diagnose ourselves, but also wanted to get > some feedback. > > Kind regards, > > Timothy Potter > [email protected] >
