RE: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

WangRamon Tue, 18 Oct 2011 01:53:16 -0700



Hi Sebastian
 
Thanks for your quick reply.
 
As far as i know latest Mahout release is: Mahout 0.5. Mahout 0.6 is still 
under development, please correct me if i were wrong, so i'm not sure can i use 
Mahout 0.6 in a product environment? We plan to run Mahout recommend Job on a 
30+ nodes environment.
 
I'm doing benchmark test right now, so I'm using a test data, every user will 
recommend about 60~120 items, so I think the data file should be fine now. I 
cannot find the two parameters listed in your mail 
"maxPrefsPerUserInItemSimilarity " and "maxPrefsPerUser", are these two for 
Mahout 0.6.  I see you mentioned to use ItemSimilarityJob, this job is not 
included in class "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob", 
instead, RecommenderJob use RowSimilarityJob, so what's difference between 
ItemSimilarityJob and RowSimilarityJob? How do i use ItemSimilarityJob? 
ThanksRamon
 
> Date: Tue, 18 Oct 2011 10:10:43 +0200
> From: [email protected]
> To: [email protected]
> Subject: Re: Any general performance tips for job 
> RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?
> 
> Hi Ramon,
> 
> my first suggestion would be to use Mahout 0.6 as significant
> improvements have been made to RowSimilarityJob and the 0.5 version has
> known bugs.
> 
> The runtime of RowSimilarityJob is not only determined by the size of
> the input but also by the distribution of the interactions among the
> users. In typical collaborative filtering datasets the interactions will
> roughly follow a power-law distribution which means that there are a few
> "power"-users with an enormous amount of interactions.
> 
> For each of these "power"-users the square of the number of their
> interactions has to be processed which means they significantly slow
> down the job without providing too much value (you don't learn a lot
> from people that like "nearly everything"). The interactions of these
> power-users need to be down-sampled which is done via the parameter
> --maxPrefsPerUserInItemSimilarity in RecommenderJob and
> --maxPrefsPerUser in ItemSimilarityJob.
> 
> --sebastian
> 
> 
> On 18.10.2011 09:55, WangRamon wrote:
> > 
> > 
> > 
> > 
> > Hi All I'm running a recommend job on a Hadoop environment with about 
> > 600000 users and 2000000 items, the total user-pref records is about 
> > 66260000, the data file is of 1GB size. I found the 
> > RowSimilarityJob-CooccurrencesMapper-SimilarityReducer job is very slow, 
> > and get a lot of logs like these in the mapper task output:  2011-10-18 
> > 15:18:49,300 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate 
> > segments out of a total of 73
> > 2011-10-18 15:20:23,410 INFO org.apache.hadoop.mapred.Merger: Merging 10 
> > intermediate segments out of a total of 64
> > 2011-10-18 15:22:45,466 INFO org.apache.hadoop.mapred.Merger: Merging 10 
> > intermediate segments out of a total of 55
> > 2011-10-18 15:25:07,928 INFO org.apache.hadoop.mapred.Merger: Merging 10 
> > intermediate segments out of a total of 46 
> > Actually, i do find some similar question from the mail list, e.g. 
> > http://mail-archives.apache.org/mod_mbox/mahout-user/201104.mbox/%[email protected]%3E
> >  , Sebastian said something about to use Mahout 0.5 in that mail thread, 
> > and yes i'm using Mahout 0.5, however there is no further discussion, it 
> > will be great if you guys can share some ideas/suggestions here, that will 
> > be a big help to me, thanks in advance. BTW, i have the following 
> > parameters already set in Hadoop:mapred.child.java.opts -> 
> > 2048Mfs.inmemory.size.mb->200io.file.buffer.size->131072 I have two 
> > servers, each with 32GB RAM, THANKS! CheersRamon                            
> >         
>
RE: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

Reply via email to