I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that reduce of mahout streamingkmeans is extremely slow.
For example: With a dataset of 2000000 objects, 128 variables, I would like to get 10000 clusters. The command executed is as the following. mahout streamingkmeans -i input -o output -ow -k 10000 -km 63000 I have 15 maps which were all completed in 4 hours. However, reduce took over 100 hours and it was still stuck at 76%. I have tuned performance of hadoop as the following. map task jvm = 3g reduce task jvm = 10g io.sort.mb = 512 io.sort.factor = 50 mapred.reduce.parallel.copies = 10 mapred.inmem.merge.threshold = 0 I tried to assign enough memory but the reduce is still very very very slow. Why does it take so much time in reduce? And What can I do to speed up the job? I wonder if it will be helpful to set -rskm to be true. -rskm option has bug in Mahout 0.8, so I cannot get a try... Yours Sincerely, Sylvia Ma
