Thanks Ted, good know about not having any "tall poles". I'll need to dig into it a bit more to answer your first question, but at least that gives me something to look for.
On Thu, Feb 24, 2011 at 3:25 PM, Ted Dunning <[email protected]> wrote: > Do you have any stats about how many clusters there are and whether a vast > number of points are being assigned to a single cluster? > > I am a little surprised at your results since the Dirichlet clustering > doesn't have any tall poles (that I know of). Every point is compared to > every cluster and contributes to every cluster. As such, stragglers > shouldn't be a big deal. > > Did you check the usual suspects with respect to swapping and GC? > > > On Thu, Feb 24, 2011 at 2:18 PM, Timothy Potter <[email protected]>wrote: > >> My colleague Szymon and I have been working on Mahout-588 and hoped to >> include Dirichlet in our clustering benchmarks, but unfortunately have not >> had much success. So we're reaching out to the community to see if anyone >> else has been successful with somewhat large-scale Dirichlet clustering. >> >> Specifically, we have 6,077,604 sparse TFIDF vectors generated from the >> Apache Mail Archives. >> >> Using vectors with 40K dimensions on a 5-node cluster it runs nicely until >> map-100% and reduce-92%. and than it virtually stops. it takes 3min to >> 93%, >> 7min to get 94%, 23min to get 95%, 1:12 to 96% and after another 4h >> nothing. >> The CPUs at the nodes run with almost 100% and full 6GB. >> >> So then we tried vectors with 20K dimensions and were able to complete 1 >> iteration after 7 hrs 32 mins. The last 3% of reduce was running 1h each >> percent, i had 4 working nodes (+1 namenode), Xmx2500 and max num of >> reducers set to 1. >> >> The job args we're using are: >> >> bin/mahout dirichlet \ >> -i /asf-mail-archives/mahout-0.4/tfidf-vectors/ \ >> -o /asf-mail-archives/mahout-0.4/dirichlet/ \ >> -a0 1.0 \ >> -x 10 \ >> --distanceMeasure >> org.apache.mahout.common.distance.CosineDistanceMeasure \ >> -k 60 >> >> >> We're still studying the code to diagnose ourselves, but also wanted to >> get >> some feedback. >> >> Kind regards, >> >> Timothy Potter >> [email protected] >> > >
