Thanks again Sebastian and Seon, I set -Xmx4000m for mapred.child.java.opts and 8 threads for each mapper. Now the job runs smoothly and the whole factorization ends in 45min. With your settings I think it should be even faster.
One more thing is that the RecommendJob is kind of slow (for all users). For example I want to have a list of top 500 items to recommend. Any pointers about how to modify the job code so that it can consult a file then calculates recommendations only for the users id in that file? 2013/3/20 Han JU <[email protected]> > Hi Sebastian, > > I've tried the svn trunk. Hadoop constantly complains about memory like > "out of memory error". > On the datanode there's 4 physic cores and by hyper-threading it has 16 > logical cores, so I set --numThreadsPerSolver to 16 and that seems to have > a problem with memory. > How you set your mapred.child.java.opts? Given that we allow only one > mapper so that should be nearly the whole size of system memory? > > Thanks! > > > 2013/3/19 Sebastian Schelter <[email protected]> > >> Hi JU, >> >> We recently rewrote the factorization code, it should be much faster >> now. You should use the current trunk, make Hadoop schedule only one >> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make >> it reuse the JVMs and add the parameter --numThreadsPerSolver with the >> number of cores that you want to use per machine (use all if you can). >> >> I got astonishing results running the code like this on a 26 machines >> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset >> (700M datapoints). >> >> Let me know if you need more information. >> >> Best, >> Sebastian >> >> On 19.03.2013 15:31, Han JU wrote: >> > Thanks Sebastian and Sean, I will dig more into the paper. >> > With a simple try on a small part of the data, it seems larger alpha >> (~40) >> > gets me a better result. >> > Do you have an idea how long it will be for ParellelALS for the 700mb >> > complete dataset? It contains ~48 million triples. The hadoop cluster I >> > dispose is of 5 nodes and can factorize the movieLens 10M in about >> 13min. >> > >> > >> > 2013/3/18 Sebastian Schelter <[email protected]> >> > >> >> You should also be aware that the alpha parameter comes from a formula >> >> the authors introduce to measure the "confidence" in the observed >> values: >> >> >> >> confidence = 1 + alpha * observed_value >> >> >> >> You can also change that formula in the code to something that you see >> >> more fit, the paper even suggests alternative variants. >> >> >> >> Best, >> >> Sebastian >> >> >> >> >> >> On 18.03.2013 18:06, Han JU wrote: >> >>> Thanks for quick responses. >> >>> >> >>> Yes it's that dataset. What I'm using is triplets of "user_id song_id >> >>> play_times", of ~ 1m users. No audio things, just plein text triples. >> >>> >> >>> It seems to me that the paper about "implicit feedback" matchs well >> this >> >>> dataset: no explicit ratings, but times of listening to a song. >> >>> >> >>> Thank you Sean for the alpha value, I think they use big numbers is >> >> because >> >>> their values in the R matrix is big. >> >>> >> >>> >> >>> 2013/3/18 Sebastian Schelter <[email protected]> >> >>> >> >>>> JU, >> >>>> >> >>>> are you refering to this dataset? >> >>>> >> >>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile >> >>>> >> >>>> On 18.03.2013 17:47, Sean Owen wrote: >> >>>>> One word of caution, is that there are at least two papers on ALS >> and >> >>>> they >> >>>>> define lambda differently. I think you are talking about >> "Collaborative >> >>>>> Filtering for Implicit Feedback Datasets". >> >>>>> >> >>>>> I've been working with some folks who point out that alpha=40 seems >> to >> >> be >> >>>>> too high for most data sets. After running some tests on common data >> >>>> sets, >> >>>>> alpha=1 looks much better. YMMV. >> >>>>> >> >>>>> In the end you have to evaluate these two parameters, and the # of >> >>>>> features, across a range to determine what's best. >> >>>>> >> >>>>> Is this data set not a bunch of audio features? I am not sure it >> works >> >>>> for >> >>>>> ALS, not naturally at least. >> >>>>> >> >>>>> >> >>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <[email protected]> >> >> wrote: >> >>>>> >> >>>>>> Hi, >> >>>>>> >> >>>>>> I'm wondering has someone tried ParallelALS with implicite feedback >> >> job >> >>>> on >> >>>>>> million song dataset? Some pointers on alpha and lambda? >> >>>>>> >> >>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what >> are >> >>>> their >> >>>>>> r values in the matrix. They said is based on time units that users >> >> have >> >>>>>> watched the show, so may be it's big. >> >>>>>> >> >>>>>> Many thanks! >> >>>>>> -- >> >>>>>> *JU Han* >> >>>>>> >> >>>>>> UTC - Université de Technologie de Compiègne >> >>>>>> * **GI06 - Fouille de Données et Décisionnel* >> >>>>>> >> >>>>>> +33 0619608888 >> >>>>>> >> >>>>> >> >>>> >> >>>> >> >>> >> >>> >> >> >> >> >> > >> > >> >> > > > -- > *JU Han* > > Software Engineer Intern @ KXEN Inc. > UTC - Université de Technologie de Compiègne > * **GI06 - Fouille de Données et Décisionnel* > > +33 0619608888 > -- *JU Han* Software Engineer Intern @ KXEN Inc. UTC - Université de Technologie de Compiègne * **GI06 - Fouille de Données et Décisionnel* +33 0619608888
