Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

WangRamon Tue, 18 Oct 2011 00:55:59 -0700



Hi All I'm running a recommend job on a Hadoop environment with about 600000 
users and 2000000 items, the total user-pref records is about 66260000, the 
data file is of 1GB size. I found the 
RowSimilarityJob-CooccurrencesMapper-SimilarityReducer job is very slow, and 
get a lot of logs like these in the mapper task output:  2011-10-18 
15:18:49,300 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate 
segments out of a total of 73
2011-10-18 15:20:23,410 INFO org.apache.hadoop.mapred.Merger: Merging 10 
intermediate segments out of a total of 64
2011-10-18 15:22:45,466 INFO org.apache.hadoop.mapred.Merger: Merging 10 
intermediate segments out of a total of 55
2011-10-18 15:25:07,928 INFO org.apache.hadoop.mapred.Merger: Merging 10 
intermediate segments out of a total of 46 
Actually, i do find some similar question from the mail list, e.g. 
http://mail-archives.apache.org/mod_mbox/mahout-user/201104.mbox/%[email protected]%3E
 , Sebastian said something about to use Mahout 0.5 in that mail thread, and 
yes i'm using Mahout 0.5, however there is no further discussion, it will be 
great if you guys can share some ideas/suggestions here, that will be a big 
help to me, thanks in advance. BTW, i have the following parameters already set 
in Hadoop:mapred.child.java.opts -> 
2048Mfs.inmemory.size.mb->200io.file.buffer.size->131072 I have two servers, 
each with 32GB RAM, THANKS! CheersRamon

Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

Reply via email to