Hi JU, I reworked the RecommenderJob in a similar way as the ALS job. Can you give it a try?
You have to try the patch from https://issues.apache.org/jira/browse/MAHOUT-1169 In introduces a new param to RecommenderJob called --numThreads. The configuration of the job should be done similar to the ALS job. /s On 20.03.2013 12:38, Han JU wrote: > Thanks again Sebastian and Seon, I set -Xmx4000m for mapred.child.java.opts > and 8 threads for each mapper. Now the job runs smoothly and the whole > factorization ends in 45min. With your settings I think it should be even > faster. > > One more thing is that the RecommendJob is kind of slow (for all users). > For example I want to have a list of top 500 items to recommend. Any > pointers about how to modify the job code so that it can consult a file > then calculates recommendations only for the users id in that file? > > > 2013/3/20 Han JU <[email protected]> > >> Hi Sebastian, >> >> I've tried the svn trunk. Hadoop constantly complains about memory like >> "out of memory error". >> On the datanode there's 4 physic cores and by hyper-threading it has 16 >> logical cores, so I set --numThreadsPerSolver to 16 and that seems to have >> a problem with memory. >> How you set your mapred.child.java.opts? Given that we allow only one >> mapper so that should be nearly the whole size of system memory? >> >> Thanks! >> >> >> 2013/3/19 Sebastian Schelter <[email protected]> >> >>> Hi JU, >>> >>> We recently rewrote the factorization code, it should be much faster >>> now. You should use the current trunk, make Hadoop schedule only one >>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make >>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the >>> number of cores that you want to use per machine (use all if you can). >>> >>> I got astonishing results running the code like this on a 26 machines >>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset >>> (700M datapoints). >>> >>> Let me know if you need more information. >>> >>> Best, >>> Sebastian >>> >>> On 19.03.2013 15:31, Han JU wrote: >>>> Thanks Sebastian and Sean, I will dig more into the paper. >>>> With a simple try on a small part of the data, it seems larger alpha >>> (~40) >>>> gets me a better result. >>>> Do you have an idea how long it will be for ParellelALS for the 700mb >>>> complete dataset? It contains ~48 million triples. The hadoop cluster I >>>> dispose is of 5 nodes and can factorize the movieLens 10M in about >>> 13min. >>>> >>>> >>>> 2013/3/18 Sebastian Schelter <[email protected]> >>>> >>>>> You should also be aware that the alpha parameter comes from a formula >>>>> the authors introduce to measure the "confidence" in the observed >>> values: >>>>> >>>>> confidence = 1 + alpha * observed_value >>>>> >>>>> You can also change that formula in the code to something that you see >>>>> more fit, the paper even suggests alternative variants. >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> >>>>> On 18.03.2013 18:06, Han JU wrote: >>>>>> Thanks for quick responses. >>>>>> >>>>>> Yes it's that dataset. What I'm using is triplets of "user_id song_id >>>>>> play_times", of ~ 1m users. No audio things, just plein text triples. >>>>>> >>>>>> It seems to me that the paper about "implicit feedback" matchs well >>> this >>>>>> dataset: no explicit ratings, but times of listening to a song. >>>>>> >>>>>> Thank you Sean for the alpha value, I think they use big numbers is >>>>> because >>>>>> their values in the R matrix is big. >>>>>> >>>>>> >>>>>> 2013/3/18 Sebastian Schelter <[email protected]> >>>>>> >>>>>>> JU, >>>>>>> >>>>>>> are you refering to this dataset? >>>>>>> >>>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile >>>>>>> >>>>>>> On 18.03.2013 17:47, Sean Owen wrote: >>>>>>>> One word of caution, is that there are at least two papers on ALS >>> and >>>>>>> they >>>>>>>> define lambda differently. I think you are talking about >>> "Collaborative >>>>>>>> Filtering for Implicit Feedback Datasets". >>>>>>>> >>>>>>>> I've been working with some folks who point out that alpha=40 seems >>> to >>>>> be >>>>>>>> too high for most data sets. After running some tests on common data >>>>>>> sets, >>>>>>>> alpha=1 looks much better. YMMV. >>>>>>>> >>>>>>>> In the end you have to evaluate these two parameters, and the # of >>>>>>>> features, across a range to determine what's best. >>>>>>>> >>>>>>>> Is this data set not a bunch of audio features? I am not sure it >>> works >>>>>>> for >>>>>>>> ALS, not naturally at least. >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <[email protected]> >>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I'm wondering has someone tried ParallelALS with implicite feedback >>>>> job >>>>>>> on >>>>>>>>> million song dataset? Some pointers on alpha and lambda? >>>>>>>>> >>>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what >>> are >>>>>>> their >>>>>>>>> r values in the matrix. They said is based on time units that users >>>>> have >>>>>>>>> watched the show, so may be it's big. >>>>>>>>> >>>>>>>>> Many thanks! >>>>>>>>> -- >>>>>>>>> *JU Han* >>>>>>>>> >>>>>>>>> UTC - Université de Technologie de Compiègne >>>>>>>>> * **GI06 - Fouille de Données et Décisionnel* >>>>>>>>> >>>>>>>>> +33 0619608888 >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> >> -- >> *JU Han* >> >> Software Engineer Intern @ KXEN Inc. >> UTC - Université de Technologie de Compiègne >> * **GI06 - Fouille de Données et Décisionnel* >> >> +33 0619608888 >> > > >
