Hi Ramon, my first suggestion would be to use Mahout 0.6 as significant improvements have been made to RowSimilarityJob and the 0.5 version has known bugs.
The runtime of RowSimilarityJob is not only determined by the size of the input but also by the distribution of the interactions among the users. In typical collaborative filtering datasets the interactions will roughly follow a power-law distribution which means that there are a few "power"-users with an enormous amount of interactions. For each of these "power"-users the square of the number of their interactions has to be processed which means they significantly slow down the job without providing too much value (you don't learn a lot from people that like "nearly everything"). The interactions of these power-users need to be down-sampled which is done via the parameter --maxPrefsPerUserInItemSimilarity in RecommenderJob and --maxPrefsPerUser in ItemSimilarityJob. --sebastian On 18.10.2011 09:55, WangRamon wrote: > > > > > Hi All I'm running a recommend job on a Hadoop environment with about 600000 > users and 2000000 items, the total user-pref records is about 66260000, the > data file is of 1GB size. I found the > RowSimilarityJob-CooccurrencesMapper-SimilarityReducer job is very slow, and > get a lot of logs like these in the mapper task output: 2011-10-18 > 15:18:49,300 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate > segments out of a total of 73 > 2011-10-18 15:20:23,410 INFO org.apache.hadoop.mapred.Merger: Merging 10 > intermediate segments out of a total of 64 > 2011-10-18 15:22:45,466 INFO org.apache.hadoop.mapred.Merger: Merging 10 > intermediate segments out of a total of 55 > 2011-10-18 15:25:07,928 INFO org.apache.hadoop.mapred.Merger: Merging 10 > intermediate segments out of a total of 46 > Actually, i do find some similar question from the mail list, e.g. > http://mail-archives.apache.org/mod_mbox/mahout-user/201104.mbox/%[email protected]%3E > , Sebastian said something about to use Mahout 0.5 in that mail thread, and > yes i'm using Mahout 0.5, however there is no further discussion, it will be > great if you guys can share some ideas/suggestions here, that will be a big > help to me, thanks in advance. BTW, i have the following parameters already > set in Hadoop:mapred.child.java.opts -> > 2048Mfs.inmemory.size.mb->200io.file.buffer.size->131072 I have two servers, > each with 32GB RAM, THANKS! CheersRamon
