Hi Jeff, it is really nice of you to reply. :)
I changed t2=0.45 and run it again, but still, it stuck at 66%. I am using consine measure, so the range of the value that make sense to me is 0-1. and 0.45 seems the biggest value i could get to, but still, it is not working. so, what is the problem here? is it the implementation of the code or i am setting the way too off-set value for parameters? is there any more info that i could provide to help you to help me analyze the issue? if i set t3,t4, would it help? thanks On Tue, Nov 13, 2012 at 10:01 PM, Jeff Eastman <[email protected]>wrote: > Canopy is very sensitive to the value of T2: Too small a value will cause > the creation of very many canopies in each mapper and these will swamp the > reducer. I suggest you begin with T1=T2= <a larger value> until you get > enough canopies. With CosineDistanceMeasure, a value of 1 ought to produce > only a single canopy and you can go smaller until you get a reasonable > number. There are also T3 and T4 arguments that allow you to specify the T1 > and T2 values used by the reducer. > > > On 11/13/12 7:01 AM, Phoenix Bai wrote: > >> Hi All, >> >> 1) data size: >> environment: company`s hadoop clusters. >> Raw data: 12M >> tfidf vectors: 25M (ng is set to 2) >> >> 2) running command: >> tfidf vectors is feed to canopy and run the command below: >> >> hadoop jar $MAHOUT_HOME/mahout-core-0.5-**job.jar >> org.apache.mahout.clustering.**canopy.CanopyDriver >> -Dmapred.max.split.size=**4000000 \ >> -i /mahout/vectors/tbvideo-**vectors/tfidf-vectors \ >> -o /mahout/output/tbvideo-canopy-**centroids/ \ >> -dm org.apache.mahout.common.**distance.CosineDistanceMeasure \ >> -t1 0.70 -t2 0.3 >> >> 3) canopy running status: >> and the MR runs like forever. I mean, map could finish very quickly while >> the reducer task always hang at 66% like below: >> >> 12/11/13 16:29:00 INFO mapred.JobClient: map 96% reduce 0% >> 12/11/13 16:29:07 INFO mapred.JobClient: map 96% reduce 30% >> 12/11/13 16:29:26 INFO mapred.JobClient: map 100% reduce 30% >> 12/11/13 16:29:41 INFO mapred.JobClient: map 100% reduce 66% >> 12/11/13 19:34:39 INFO mapred.JobClient: map 100% reduce 0% >> 12/11/13 19:34:47 INFO mapred.JobClient: Task Id : >> attempt_201210311519_1936030_**r_000000_0, Status : FAILED >> java.io.IOException: Task process exit with nonzero status of 137. >> at org.apache.hadoop.mapred.**TaskRunner.run(TaskRunner.**java:456) >> 12/11/13 19:35:06 INFO mapred.JobClient: map 100% reduce 66% >> >> or sometimes erorr like this: >> >> 000000_0, Status : FAILED >> Task attempt_201210311519_1900983_**r_000000_0 failed to report status >> for 600 seconds. Killing! >> >> Here is the jstack dump when it gets to 66%: >> >> * >> "main" prio=10 tid=0x000000005071a000 nid=0x7ab8 runnable >> [0x0000000040a3a000] >> java.lang.Thread.State: RUNNABLE >> at org.apache.mahout.math.**OrderedIntDoubleMapping.find(** >> OrderedIntDoubleMapping.java:**83) >> at org.apache.mahout.math.**OrderedIntDoubleMapping.get(** >> OrderedIntDoubleMapping.java:**88) >> at org.apache.mahout.math.**SequentialAccessSparseVector.** >> getQuick(**SequentialAccessSparseVector.**java:184) >> at org.apache.mahout.math.**AbstractVector.get(** >> AbstractVector.java:138) >> at org.apache.mahout.clustering.**AbstractCluster.formatVector(* >> *AbstractCluster.java:301) >> at org.apache.mahout.clustering.**canopy.CanopyClusterer.** >> addPointToCanopies(**CanopyClusterer.java:163) >> at org.apache.mahout.clustering.**canopy.CanopyReducer.reduce(** >> CanopyReducer.java:44) >> at org.apache.mahout.clustering.**canopy.CanopyReducer.reduce(** >> CanopyRed >> * >> *ucer.java:29) >> >> at org.apache.hadoop.mapreduce.**Reducer.run(Reducer.java:176) >> at org.apache.hadoop.mapred.**ReduceTask.runNewReducer(** >> ReduceTask.java:544) >> at org.apache.hadoop.mapred.**ReduceTask.run(ReduceTask.** >> java:407) >> at org.apache.hadoop.mapred.**Child.main(Child.java:167)* >> >> ** >> >> 4) So, my question is, >> >> what is wrong? why it always hang at 66%? >> I thought canopy is a faster algorithm when comparing to kmeans. >> but in this case, kmeans could run whole lot faster than canopy. >> I run the canopy several times across two days, and never get to see the >> end. >> it always throws errors whenever get to the 66% of reducing process. >> >> Please, enlighten me. or give me to a direction to what could be the >> problem? and How could I fix it? >> it is only 30M data, so it can`t be the size, right? >> >> Thanks all in advance! >> >> >
