Do you have any stats about how many clusters there are and whether a vast
number of points are being assigned to a single cluster?

I am a little surprised at your results since the Dirichlet clustering
doesn't have any tall poles (that I know of).  Every point is compared to
every cluster and contributes to every cluster.  As such, stragglers
shouldn't be a big deal.

Did you check the usual suspects with respect to swapping and GC?

On Thu, Feb 24, 2011 at 2:18 PM, Timothy Potter <[email protected]>wrote:

> My colleague Szymon and I have been working on Mahout-588 and hoped to
> include Dirichlet in our clustering benchmarks, but unfortunately have not
> had much success. So we're reaching out to the community to see if anyone
> else has been successful with somewhat large-scale Dirichlet clustering.
>
> Specifically, we have  6,077,604 sparse TFIDF vectors generated from the
> Apache Mail Archives.
>
> Using vectors with 40K dimensions on a 5-node cluster it runs nicely until
> map-100% and reduce-92%. and than it virtually stops. it takes 3min to 93%,
> 7min to get 94%, 23min to get 95%, 1:12 to 96% and after another 4h
> nothing.
> The CPUs at the nodes run with almost 100% and full 6GB.
>
> So then we tried vectors with 20K dimensions and were able to complete 1
> iteration after 7 hrs 32 mins. The last 3% of reduce was running 1h each
> percent, i had 4 working nodes (+1 namenode), Xmx2500 and max num of
> reducers set to 1.
>
> The job args we're using are:
>
> bin/mahout dirichlet \
>    -i /asf-mail-archives/mahout-0.4/tfidf-vectors/ \
>    -o /asf-mail-archives/mahout-0.4/dirichlet/ \
>    -a0 1.0 \
>    -x 10 \
>    --distanceMeasure
> org.apache.mahout.common.distance.CosineDistanceMeasure \
>    -k 60
>
>
> We're still studying the code to diagnose ourselves, but also wanted to get
> some feedback.
>
> Kind regards,
>
> Timothy Potter
> [email protected]
>

Reply via email to