Hi All I'm running a recommend job on a Hadoop environment with about 600000
users and 2000000 items, the total user-pref records is about 66260000, the
data file is of 1GB size. I found the
RowSimilarityJob-CooccurrencesMapper-SimilarityReducer job is very slow, and
get a lot of logs like these in the mapper task output: 2011-10-18
15:18:49,300 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate
segments out of a total of 73
2011-10-18 15:20:23,410 INFO org.apache.hadoop.mapred.Merger: Merging 10
intermediate segments out of a total of 64
2011-10-18 15:22:45,466 INFO org.apache.hadoop.mapred.Merger: Merging 10
intermediate segments out of a total of 55
2011-10-18 15:25:07,928 INFO org.apache.hadoop.mapred.Merger: Merging 10
intermediate segments out of a total of 46
Actually, i do find some similar question from the mail list, e.g.
http://mail-archives.apache.org/mod_mbox/mahout-user/201104.mbox/%[email protected]%3E
, Sebastian said something about to use Mahout 0.5 in that mail thread, and
yes i'm using Mahout 0.5, however there is no further discussion, it will be
great if you guys can share some ideas/suggestions here, that will be a big
help to me, thanks in advance. BTW, i have the following parameters already set
in Hadoop:mapred.child.java.opts ->
2048Mfs.inmemory.size.mb->200io.file.buffer.size->131072 I have two servers,
each with 32GB RAM, THANKS! CheersRamon