Re: Issue: Canopy is processing extremly slow, what goes wrong?

Phoenix Bai Wed, 14 Nov 2012 02:41:49 -0800

Hi Jeff,

it is really nice of you to reply. :)


I changed t2=0.45 and run it again, but still, it stuck at 66%.

I am using consine measure, so the range of the value that make sense to me
is 0-1.

and 0.45 seems the biggest value i could get to, but still, it is not
working.

so, what is the problem here? is it the implementation of the code or i am
setting the way too off-set value for parameters?
is there any more info that i could provide to help you to help me analyze
the issue?

if i set t3,t4, would it help?

thanks

On Tue, Nov 13, 2012 at 10:01 PM, Jeff Eastman
<[email protected]>wrote:

> Canopy is very sensitive to the value of T2: Too small a value will cause
> the creation of very many canopies in each mapper and these will swamp the
> reducer.  I suggest you begin with T1=T2= <a larger value> until you get
> enough canopies. With CosineDistanceMeasure, a value of 1 ought to produce
> only a single canopy and you can go smaller until you get a reasonable
> number. There are also T3 and T4 arguments that allow you to specify the T1
> and T2 values used by the reducer.
>
>
> On 11/13/12 7:01 AM, Phoenix Bai wrote:
>
>>  Hi All,
>>
>> 1) data size:
>> environment: company`s hadoop clusters.
>> Raw data: 12M
>> tfidf vectors: 25M (ng is set to 2)
>>
>> 2) running command:
>> tfidf vectors is feed to canopy and run the command below:
>>
>> hadoop jar $MAHOUT_HOME/mahout-core-0.5-**job.jar
>> org.apache.mahout.clustering.**canopy.CanopyDriver
>> -Dmapred.max.split.size=**4000000 \
>> -i /mahout/vectors/tbvideo-**vectors/tfidf-vectors \
>> -o /mahout/output/tbvideo-canopy-**centroids/ \
>> -dm org.apache.mahout.common.**distance.CosineDistanceMeasure \
>> -t1 0.70 -t2 0.3
>>
>> 3) canopy running status:
>> and the MR runs like forever. I mean, map could finish very quickly while
>> the reducer task always hang at 66% like below:
>>
>> 12/11/13 16:29:00 INFO mapred.JobClient:  map 96% reduce 0%
>> 12/11/13 16:29:07 INFO mapred.JobClient:  map 96% reduce 30%
>> 12/11/13 16:29:26 INFO mapred.JobClient:  map 100% reduce 30%
>> 12/11/13 16:29:41 INFO mapred.JobClient:  map 100% reduce 66%
>> 12/11/13 19:34:39 INFO mapred.JobClient:  map 100% reduce 0%
>> 12/11/13 19:34:47 INFO mapred.JobClient: Task Id :
>> attempt_201210311519_1936030_**r_000000_0, Status : FAILED
>> java.io.IOException: Task process exit with nonzero status of 137.
>>   at org.apache.hadoop.mapred.**TaskRunner.run(TaskRunner.**java:456)
>> 12/11/13 19:35:06 INFO mapred.JobClient:  map 100% reduce 66%
>>
>> or sometimes erorr like this:
>>
>> 000000_0, Status : FAILED
>> Task attempt_201210311519_1900983_**r_000000_0 failed to report status
>> for 600 seconds. Killing!
>>
>> Here is the jstack dump when it gets to 66%:
>>
>>   *
>> "main" prio=10 tid=0x000000005071a000 nid=0x7ab8 runnable
>> [0x0000000040a3a000]
>>     java.lang.Thread.State: RUNNABLE
>>          at org.apache.mahout.math.**OrderedIntDoubleMapping.find(**
>> OrderedIntDoubleMapping.java:**83)
>>          at org.apache.mahout.math.**OrderedIntDoubleMapping.get(**
>> OrderedIntDoubleMapping.java:**88)
>>          at org.apache.mahout.math.**SequentialAccessSparseVector.**
>> getQuick(**SequentialAccessSparseVector.**java:184)
>>          at org.apache.mahout.math.**AbstractVector.get(**
>> AbstractVector.java:138)
>>          at org.apache.mahout.clustering.**AbstractCluster.formatVector(*
>> *AbstractCluster.java:301)
>>          at org.apache.mahout.clustering.**canopy.CanopyClusterer.**
>> addPointToCanopies(**CanopyClusterer.java:163)
>>          at org.apache.mahout.clustering.**canopy.CanopyReducer.reduce(**
>> CanopyReducer.java:44)
>>          at org.apache.mahout.clustering.**canopy.CanopyReducer.reduce(**
>> CanopyRed
>> *
>> *ucer.java:29)
>>
>>          at org.apache.hadoop.mapreduce.**Reducer.run(Reducer.java:176)
>>          at org.apache.hadoop.mapred.**ReduceTask.runNewReducer(**
>> ReduceTask.java:544)
>>          at org.apache.hadoop.mapred.**ReduceTask.run(ReduceTask.**
>> java:407)
>>          at org.apache.hadoop.mapred.**Child.main(Child.java:167)*
>>
>> **
>>
>> 4) So, my question is,
>>
>> what is wrong? why it always hang at 66%?
>> I thought canopy is a faster algorithm when comparing to kmeans.
>> but in this case, kmeans could run whole lot faster than canopy.
>> I run the canopy several times across two days, and never get to see the
>> end.
>> it always throws errors whenever get to the 66% of reducing process.
>>
>> Please, enlighten me. or give me to a direction to what could be the
>> problem? and How could I fix it?
>> it is only 30M data, so it can`t be the size, right?
>>
>> Thanks all in advance!
>>
>>
>

Re: Issue: Canopy is processing extremly slow, what goes wrong?

Reply via email to