Canopy is very sensitive to the value of T2: Too small a value will cause the creation of very many canopies in each mapper and these will swamp the reducer. I suggest you begin with T1=T2= <a larger value> until you get enough canopies. With CosineDistanceMeasure, a value of 1 ought to produce only a single canopy and you can go smaller until you get a reasonable number. There are also T3 and T4 arguments that allow you to specify the T1 and T2 values used by the reducer.

On 11/13/12 7:01 AM, Phoenix Bai wrote:
Hi All,

1) data size:
environment: company`s hadoop clusters.
Raw data: 12M
tfidf vectors: 25M (ng is set to 2)

2) running command:
tfidf vectors is feed to canopy and run the command below:

hadoop jar $MAHOUT_HOME/mahout-core-0.5-job.jar
org.apache.mahout.clustering.canopy.CanopyDriver
-Dmapred.max.split.size=4000000 \
-i /mahout/vectors/tbvideo-vectors/tfidf-vectors \
-o /mahout/output/tbvideo-canopy-centroids/ \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure \
-t1 0.70 -t2 0.3

3) canopy running status:
and the MR runs like forever. I mean, map could finish very quickly while
the reducer task always hang at 66% like below:

12/11/13 16:29:00 INFO mapred.JobClient:  map 96% reduce 0%
12/11/13 16:29:07 INFO mapred.JobClient:  map 96% reduce 30%
12/11/13 16:29:26 INFO mapred.JobClient:  map 100% reduce 30%
12/11/13 16:29:41 INFO mapred.JobClient:  map 100% reduce 66%
12/11/13 19:34:39 INFO mapred.JobClient:  map 100% reduce 0%
12/11/13 19:34:47 INFO mapred.JobClient: Task Id :
attempt_201210311519_1936030_r_000000_0, Status : FAILED
java.io.IOException: Task process exit with nonzero status of 137.
  at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:456)
12/11/13 19:35:06 INFO mapred.JobClient:  map 100% reduce 66%

or sometimes erorr like this:

000000_0, Status : FAILED
Task attempt_201210311519_1900983_r_000000_0 failed to report status
for 600 seconds. Killing!

Here is the jstack dump when it gets to 66%:

  *
"main" prio=10 tid=0x000000005071a000 nid=0x7ab8 runnable [0x0000000040a3a000]
    java.lang.Thread.State: RUNNABLE
         at 
org.apache.mahout.math.OrderedIntDoubleMapping.find(OrderedIntDoubleMapping.java:83)
         at 
org.apache.mahout.math.OrderedIntDoubleMapping.get(OrderedIntDoubleMapping.java:88)
         at 
org.apache.mahout.math.SequentialAccessSparseVector.getQuick(SequentialAccessSparseVector.java:184)
         at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:138)
         at 
org.apache.mahout.clustering.AbstractCluster.formatVector(AbstractCluster.java:301)
         at 
org.apache.mahout.clustering.canopy.CanopyClusterer.addPointToCanopies(CanopyClusterer.java:163)
         at 
org.apache.mahout.clustering.canopy.CanopyReducer.reduce(CanopyReducer.java:44)
         at org.apache.mahout.clustering.canopy.CanopyReducer.reduce(CanopyRed
*
*ucer.java:29)
         at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
         at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:544)
         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:407)
         at org.apache.hadoop.mapred.Child.main(Child.java:167)*

**
4) So, my question is,

what is wrong? why it always hang at 66%?
I thought canopy is a faster algorithm when comparing to kmeans.
but in this case, kmeans could run whole lot faster than canopy.
I run the canopy several times across two days, and never get to see the
end.
it always throws errors whenever get to the 66% of reducing process.

Please, enlighten me. or give me to a direction to what could be the
problem? and How could I fix it?
it is only 30M data, so it can`t be the size, right?

Thanks all in advance!


Reply via email to