Thanks for quick response. I called the job like this:
hadoop jar mahout-core-0.8-SNAPSHOT-job.jar \ org.apache.mahout.cf.taste.hadoop.item.RecommenderJob \ -Dmapred.child.java.opts=-Xmx2048m \ --input path \ --output outpath \ --similarityClassname SIMILARITY_LOGLIKELIHOOD \ --maxSimilaritiesPerItem 500 \ --booleanData true --numRecommendations 500 \ --threashold 0.01 \ --usersFile users_to_recomm The resulting file contains 113K (sorry not 130K) users rather than 110K users in the usersFile. My datasets is about 720Mb in size, it is split into 64Mb blocks in my 4 datanodes. Params like -D mapred.reduce.tasks=10 can make it faster? Thanks ! 2013/3/25 Sebastian Schelter <[email protected]> > Hi JU, > > are you sure regarding 1. ? It would be a bug. How do you exactly call > the job? > > 2. The threshold is used during the similarity computation and is a > lower bound for the similarities considered. For certain measures (like > Pearson or Cosine) it also allows to prune some item pairs early. You > have to choose it experimentally according to your usecase. > > 3. The job has a higher computational complexity than ALS and its > runtime depends on the distribution of the interactions, e.g. users with > a high number of interactions cause the job to take very long. There is > a parameter that controls this, maxPrefsPerUserInItemSimilarity, per > default it is 1000 (which means 1000 interactions are considered). You > can set this to something like 500 if you want. > > Regarding the fact that only one reducer runs, how large is your input, > does it span several blocks in hdfs? > > 48M datapoints is not that much, you could try to do the recommendation > on a single machine if you have sufficient memory. The class > o.a.m.cf.taste.similarity.precompute.example.BatchItemSimilaritiesGroupLens > shows how to precompute similarities efficiently on a single machine. > After that, you can instantiate a recommender with the similarities to > get your 110000 recommendations. > > > > > > On 25.03.2013 10:31, Han JU wrote: > > Hi, > > > > After ParallelAlsJob, I'm trying now the parallel item-based recommender > > job. Here's some questions. > > > > 1. I specified a userFile, which contains 110000 diff. users, but the > > output contains more than this, nearly 130000 users' recommendatoin. > Why is > > this? > > 2. How the threashold value is chosen in real cases? For example I'm > > using boolean data and LogLikelyHood. > > 3. The job runs slowly, nearly 8h on 48M datapoints. By default all > jobs > > have only one reducer, which is the slowest part. How should I choose > and > > set the reducer number to make it fast?For example the last job, > > PartialMultiplyMapper-Reducer, takes 7h and its reducer takes 5h. On > the > > same data ParallelAls finishes in 1.5h with the threaded version. > > > > Thanks! > > > > -- *JU Han* UTC - Université de Technologie de Compiègne * **GI06 - Fouille de Données et Décisionnel* +33 0619608888
