Re: Dirichlet clustering woes ...

Timothy Potter Thu, 24 Feb 2011 15:01:53 -0800

Thanks Ted, good know about not having any "tall poles". I'll need to dig
into it a bit more to answer your first question, but at least that gives me
something to look for.



On Thu, Feb 24, 2011 at 3:25 PM, Ted Dunning <[email protected]> wrote:

> Do you have any stats about how many clusters there are and whether a vast
> number of points are being assigned to a single cluster?
>
> I am a little surprised at your results since the Dirichlet clustering
> doesn't have any tall poles (that I know of).  Every point is compared to
> every cluster and contributes to every cluster.  As such, stragglers
> shouldn't be a big deal.
>
> Did you check the usual suspects with respect to swapping and GC?
>
>
> On Thu, Feb 24, 2011 at 2:18 PM, Timothy Potter <[email protected]>wrote:
>
>> My colleague Szymon and I have been working on Mahout-588 and hoped to
>> include Dirichlet in our clustering benchmarks, but unfortunately have not
>> had much success. So we're reaching out to the community to see if anyone
>> else has been successful with somewhat large-scale Dirichlet clustering.
>>
>> Specifically, we have  6,077,604 sparse TFIDF vectors generated from the
>> Apache Mail Archives.
>>
>> Using vectors with 40K dimensions on a 5-node cluster it runs nicely until
>> map-100% and reduce-92%. and than it virtually stops. it takes 3min to
>> 93%,
>> 7min to get 94%, 23min to get 95%, 1:12 to 96% and after another 4h
>> nothing.
>> The CPUs at the nodes run with almost 100% and full 6GB.
>>
>> So then we tried vectors with 20K dimensions and were able to complete 1
>> iteration after 7 hrs 32 mins. The last 3% of reduce was running 1h each
>> percent, i had 4 working nodes (+1 namenode), Xmx2500 and max num of
>> reducers set to 1.
>>
>> The job args we're using are:
>>
>> bin/mahout dirichlet \
>>    -i /asf-mail-archives/mahout-0.4/tfidf-vectors/ \
>>    -o /asf-mail-archives/mahout-0.4/dirichlet/ \
>>    -a0 1.0 \
>>    -x 10 \
>>    --distanceMeasure
>> org.apache.mahout.common.distance.CosineDistanceMeasure \
>>    -k 60
>>
>>
>> We're still studying the code to diagnose ourselves, but also wanted to
>> get
>> some feedback.
>>
>> Kind regards,
>>
>> Timothy Potter
>> [email protected]
>>
>
>

Re: Dirichlet clustering woes ...

Reply via email to