Hi JU, the job creates an OpenIntObjectHashMap<Vector> holding the feature vectors as DenseVectors. In one map-job, it is filled with the user-feature vectors, in the next one with the item feature vectors.
I used 4 gigabytes for a dataset with 1.8M users (using 20 features), so I guess that 2-3gig should be enough for your dataset. I used these settings: mapred.job.reuse.jvm.num.tasks=-1 mapred.tasktracker.map.tasks.maximum=1 mapred.child.java.opts=-Xmx4096m On 20.03.2013 10:01, Han JU wrote: > Hi Sebastian, > > I've tried the svn trunk. Hadoop constantly complains about memory like > "out of memory error". > On the datanode there's 4 physic cores and by hyper-threading it has 16 > logical cores, so I set --numThreadsPerSolver to 16 and that seems to have > a problem with memory. > How you set your mapred.child.java.opts? Given that we allow only one > mapper so that should be nearly the whole size of system memory? > > Thanks! > > > 2013/3/19 Sebastian Schelter <[email protected]> > >> Hi JU, >> >> We recently rewrote the factorization code, it should be much faster >> now. You should use the current trunk, make Hadoop schedule only one >> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make >> it reuse the JVMs and add the parameter --numThreadsPerSolver with the >> number of cores that you want to use per machine (use all if you can). >> >> I got astonishing results running the code like this on a 26 machines >> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset >> (700M datapoints). >> >> Let me know if you need more information. >> >> Best, >> Sebastian >> >> On 19.03.2013 15:31, Han JU wrote: >>> Thanks Sebastian and Sean, I will dig more into the paper. >>> With a simple try on a small part of the data, it seems larger alpha >> (~40) >>> gets me a better result. >>> Do you have an idea how long it will be for ParellelALS for the 700mb >>> complete dataset? It contains ~48 million triples. The hadoop cluster I >>> dispose is of 5 nodes and can factorize the movieLens 10M in about 13min. >>> >>> >>> 2013/3/18 Sebastian Schelter <[email protected]> >>> >>>> You should also be aware that the alpha parameter comes from a formula >>>> the authors introduce to measure the "confidence" in the observed >> values: >>>> >>>> confidence = 1 + alpha * observed_value >>>> >>>> You can also change that formula in the code to something that you see >>>> more fit, the paper even suggests alternative variants. >>>> >>>> Best, >>>> Sebastian >>>> >>>> >>>> On 18.03.2013 18:06, Han JU wrote: >>>>> Thanks for quick responses. >>>>> >>>>> Yes it's that dataset. What I'm using is triplets of "user_id song_id >>>>> play_times", of ~ 1m users. No audio things, just plein text triples. >>>>> >>>>> It seems to me that the paper about "implicit feedback" matchs well >> this >>>>> dataset: no explicit ratings, but times of listening to a song. >>>>> >>>>> Thank you Sean for the alpha value, I think they use big numbers is >>>> because >>>>> their values in the R matrix is big. >>>>> >>>>> >>>>> 2013/3/18 Sebastian Schelter <[email protected]> >>>>> >>>>>> JU, >>>>>> >>>>>> are you refering to this dataset? >>>>>> >>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile >>>>>> >>>>>> On 18.03.2013 17:47, Sean Owen wrote: >>>>>>> One word of caution, is that there are at least two papers on ALS and >>>>>> they >>>>>>> define lambda differently. I think you are talking about >> "Collaborative >>>>>>> Filtering for Implicit Feedback Datasets". >>>>>>> >>>>>>> I've been working with some folks who point out that alpha=40 seems >> to >>>> be >>>>>>> too high for most data sets. After running some tests on common data >>>>>> sets, >>>>>>> alpha=1 looks much better. YMMV. >>>>>>> >>>>>>> In the end you have to evaluate these two parameters, and the # of >>>>>>> features, across a range to determine what's best. >>>>>>> >>>>>>> Is this data set not a bunch of audio features? I am not sure it >> works >>>>>> for >>>>>>> ALS, not naturally at least. >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <[email protected]> >>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I'm wondering has someone tried ParallelALS with implicite feedback >>>> job >>>>>> on >>>>>>>> million song dataset? Some pointers on alpha and lambda? >>>>>>>> >>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what >> are >>>>>> their >>>>>>>> r values in the matrix. They said is based on time units that users >>>> have >>>>>>>> watched the show, so may be it's big. >>>>>>>> >>>>>>>> Many thanks! >>>>>>>> -- >>>>>>>> *JU Han* >>>>>>>> >>>>>>>> UTC - Université de Technologie de Compiègne >>>>>>>> * **GI06 - Fouille de Données et Décisionnel* >>>>>>>> >>>>>>>> +33 0619608888 >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > >
