Thanks Sebastian and Sean, I will dig more into the paper. With a simple try on a small part of the data, it seems larger alpha (~40) gets me a better result. Do you have an idea how long it will be for ParellelALS for the 700mb complete dataset? It contains ~48 million triples. The hadoop cluster I dispose is of 5 nodes and can factorize the movieLens 10M in about 13min.
2013/3/18 Sebastian Schelter <[email protected]> > You should also be aware that the alpha parameter comes from a formula > the authors introduce to measure the "confidence" in the observed values: > > confidence = 1 + alpha * observed_value > > You can also change that formula in the code to something that you see > more fit, the paper even suggests alternative variants. > > Best, > Sebastian > > > On 18.03.2013 18:06, Han JU wrote: > > Thanks for quick responses. > > > > Yes it's that dataset. What I'm using is triplets of "user_id song_id > > play_times", of ~ 1m users. No audio things, just plein text triples. > > > > It seems to me that the paper about "implicit feedback" matchs well this > > dataset: no explicit ratings, but times of listening to a song. > > > > Thank you Sean for the alpha value, I think they use big numbers is > because > > their values in the R matrix is big. > > > > > > 2013/3/18 Sebastian Schelter <[email protected]> > > > >> JU, > >> > >> are you refering to this dataset? > >> > >> http://labrosa.ee.columbia.edu/millionsong/tasteprofile > >> > >> On 18.03.2013 17:47, Sean Owen wrote: > >>> One word of caution, is that there are at least two papers on ALS and > >> they > >>> define lambda differently. I think you are talking about "Collaborative > >>> Filtering for Implicit Feedback Datasets". > >>> > >>> I've been working with some folks who point out that alpha=40 seems to > be > >>> too high for most data sets. After running some tests on common data > >> sets, > >>> alpha=1 looks much better. YMMV. > >>> > >>> In the end you have to evaluate these two parameters, and the # of > >>> features, across a range to determine what's best. > >>> > >>> Is this data set not a bunch of audio features? I am not sure it works > >> for > >>> ALS, not naturally at least. > >>> > >>> > >>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <[email protected]> > wrote: > >>> > >>>> Hi, > >>>> > >>>> I'm wondering has someone tried ParallelALS with implicite feedback > job > >> on > >>>> million song dataset? Some pointers on alpha and lambda? > >>>> > >>>> In the paper alpha is 40 and lambda is 150, but I don't know what are > >> their > >>>> r values in the matrix. They said is based on time units that users > have > >>>> watched the show, so may be it's big. > >>>> > >>>> Many thanks! > >>>> -- > >>>> *JU Han* > >>>> > >>>> UTC - Université de Technologie de Compiègne > >>>> * **GI06 - Fouille de Données et Décisionnel* > >>>> > >>>> +33 0619608888 > >>>> > >>> > >> > >> > > > > > > -- *JU Han* Software Engineer Intern @ KXEN Inc. UTC - Université de Technologie de Compiègne * **GI06 - Fouille de Données et Décisionnel* +33 0619608888
