RE: Dirichlet clustering woes ...

Jeff Eastman Thu, 24 Feb 2011 15:09:54 -0800

I'm surprised too. It looks like you are creating 60 clusters which is 
completely reasonable. During map processing, each point is compared to each 
cluster to generate its pdf() and the point is assigned to one of the clusters 
using a multinomial over all the pdfs. If you have many points assigned to one 
of your clusters by this process then the copy-merge step could take a while to 
build the reducer input for that cluster. How many mappers are being created 
from your dataset?

The reducers then accumulate the posterior statistics for one or more clusters. 
You can try increasing the number of reducers (up to k) which can help with 
this step. Again, if most of your points are being assigned to a single cluster 
that reducer will be bogged down observing them all. Also, since the models 
accumulate Gaussian statistics to compute mean and std posterior values these 
values will tend to become denser as many vectors are summed and this can drive 
up memory consumption during the reduce step.

You might try increasing the value of -k to spread the vectors over more 
clusters. Adjusting the value of -a0 could also cause input vectors to be more 
evenly distributed over the initial prior clusters (which have random center 
vectors). For text, you might find that the L1Model with a 
CosineDistanceMeasure could work better than the default 
NormalModelDistribution. 

You are breaking new ground here. I've run Dirichlet over Reuters and it seemed 
to work ok at that scale. 

Jeff

-----Original Message-----
From: Ted Dunning [mailto:[email protected]] 
Sent: Thursday, February 24, 2011 2:26 PM
To: [email protected]
Cc: Timothy Potter
Subject: Re: Dirichlet clustering woes ...

Do you have any stats about how many clusters there are and whether a vast
number of points are being assigned to a single cluster?

I am a little surprised at your results since the Dirichlet clustering
doesn't have any tall poles (that I know of).  Every point is compared to
every cluster and contributes to every cluster.  As such, stragglers
shouldn't be a big deal.

Did you check the usual suspects with respect to swapping and GC?

On Thu, Feb 24, 2011 at 2:18 PM, Timothy Potter <[email protected]>wrote:

> My colleague Szymon and I have been working on Mahout-588 and hoped to
> include Dirichlet in our clustering benchmarks, but unfortunately have not
> had much success. So we're reaching out to the community to see if anyone
> else has been successful with somewhat large-scale Dirichlet clustering.
>
> Specifically, we have  6,077,604 sparse TFIDF vectors generated from the
> Apache Mail Archives.
>
> Using vectors with 40K dimensions on a 5-node cluster it runs nicely until
> map-100% and reduce-92%. and than it virtually stops. it takes 3min to 93%,
> 7min to get 94%, 23min to get 95%, 1:12 to 96% and after another 4h
> nothing.
> The CPUs at the nodes run with almost 100% and full 6GB.
>
> So then we tried vectors with 20K dimensions and were able to complete 1
> iteration after 7 hrs 32 mins. The last 3% of reduce was running 1h each
> percent, i had 4 working nodes (+1 namenode), Xmx2500 and max num of
> reducers set to 1.
>
> The job args we're using are:
>
> bin/mahout dirichlet \
>    -i /asf-mail-archives/mahout-0.4/tfidf-vectors/ \
>    -o /asf-mail-archives/mahout-0.4/dirichlet/ \
>    -a0 1.0 \
>    -x 10 \
>    --distanceMeasure
> org.apache.mahout.common.distance.CosineDistanceMeasure \
>    -k 60
>
>
> We're still studying the code to diagnose ourselves, but also wanted to get
> some feedback.
>
> Kind regards,
>
> Timothy Potter
> [email protected]
>

RE: Dirichlet clustering woes ...

Reply via email to