Hi Ramon,

my first suggestion would be to use Mahout 0.6 as significant
improvements have been made to RowSimilarityJob and the 0.5 version has
known bugs.

The runtime of RowSimilarityJob is not only determined by the size of
the input but also by the distribution of the interactions among the
users. In typical collaborative filtering datasets the interactions will
roughly follow a power-law distribution which means that there are a few
"power"-users with an enormous amount of interactions.

For each of these "power"-users the square of the number of their
interactions has to be processed which means they significantly slow
down the job without providing too much value (you don't learn a lot
from people that like "nearly everything"). The interactions of these
power-users need to be down-sampled which is done via the parameter
--maxPrefsPerUserInItemSimilarity in RecommenderJob and
--maxPrefsPerUser in ItemSimilarityJob.

--sebastian


On 18.10.2011 09:55, WangRamon wrote:
> 
> 
> 
> 
> Hi All I'm running a recommend job on a Hadoop environment with about 600000 
> users and 2000000 items, the total user-pref records is about 66260000, the 
> data file is of 1GB size. I found the 
> RowSimilarityJob-CooccurrencesMapper-SimilarityReducer job is very slow, and 
> get a lot of logs like these in the mapper task output:  2011-10-18 
> 15:18:49,300 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate 
> segments out of a total of 73
> 2011-10-18 15:20:23,410 INFO org.apache.hadoop.mapred.Merger: Merging 10 
> intermediate segments out of a total of 64
> 2011-10-18 15:22:45,466 INFO org.apache.hadoop.mapred.Merger: Merging 10 
> intermediate segments out of a total of 55
> 2011-10-18 15:25:07,928 INFO org.apache.hadoop.mapred.Merger: Merging 10 
> intermediate segments out of a total of 46 
> Actually, i do find some similar question from the mail list, e.g. 
> http://mail-archives.apache.org/mod_mbox/mahout-user/201104.mbox/%[email protected]%3E
>  , Sebastian said something about to use Mahout 0.5 in that mail thread, and 
> yes i'm using Mahout 0.5, however there is no further discussion, it will be 
> great if you guys can share some ideas/suggestions here, that will be a big 
> help to me, thanks in advance. BTW, i have the following parameters already 
> set in Hadoop:mapred.child.java.opts -> 
> 2048Mfs.inmemory.size.mb->200io.file.buffer.size->131072 I have two servers, 
> each with 32GB RAM, THANKS! CheersRamon                                      

Reply via email to