Re: ALS-WR on Million Song dataset

Sebastian Schelter Wed, 20 Mar 2013 03:40:32 -0700

Hi JU,

the job creates an OpenIntObjectHashMap<Vector> holding the feature
vectors as DenseVectors. In one map-job, it is filled with the
user-feature vectors, in the next one with the item feature vectors.


I used 4 gigabytes for a dataset with  1.8M users (using 20 features),
so I guess that 2-3gig should be enough for your dataset.

I used these settings:

mapred.job.reuse.jvm.num.tasks=-1
mapred.tasktracker.map.tasks.maximum=1
mapred.child.java.opts=-Xmx4096m

On 20.03.2013 10:01, Han JU wrote:
> Hi Sebastian,
> 
> I've tried the svn trunk. Hadoop constantly complains about memory like
> "out of memory error".
> On the datanode there's 4 physic cores and by hyper-threading it has 16
> logical cores, so I set --numThreadsPerSolver to 16 and that seems to have
> a problem with memory.
> How you set your mapred.child.java.opts? Given that we allow only one
> mapper so that should be nearly the whole size of system memory?
> 
> Thanks!
> 
> 
> 2013/3/19 Sebastian Schelter <[email protected]>
> 
>> Hi JU,
>>
>> We recently rewrote the factorization code, it should be much faster
>> now. You should use the current trunk, make Hadoop schedule only one
>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make
>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the
>> number of cores that you want to use per machine (use all if you can).
>>
>> I got astonishing results running the code like this on a 26 machines
>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset
>> (700M datapoints).
>>
>> Let me know if you need more information.
>>
>> Best,
>> Sebastian
>>
>> On 19.03.2013 15:31, Han JU wrote:
>>> Thanks Sebastian and Sean, I will dig more into the paper.
>>> With a simple try on a small part of the data, it seems larger alpha
>> (~40)
>>> gets me a better result.
>>> Do you have an idea how long it will be for ParellelALS for the 700mb
>>> complete dataset? It contains ~48 million triples. The hadoop cluster I
>>> dispose is of 5 nodes and can factorize the movieLens 10M in about 13min.
>>>
>>>
>>> 2013/3/18 Sebastian Schelter <[email protected]>
>>>
>>>> You should also be aware that the alpha parameter comes from a formula
>>>> the authors introduce to measure the "confidence" in the observed
>> values:
>>>>
>>>> confidence = 1 + alpha * observed_value
>>>>
>>>> You can also change that formula in the code to something that you see
>>>> more fit, the paper even suggests alternative variants.
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>>
>>>> On 18.03.2013 18:06, Han JU wrote:
>>>>> Thanks for quick responses.
>>>>>
>>>>> Yes it's that dataset. What I'm using is triplets of "user_id song_id
>>>>> play_times", of ~ 1m users. No audio things, just plein text triples.
>>>>>
>>>>> It seems to me that the paper about "implicit feedback" matchs well
>> this
>>>>> dataset: no explicit ratings, but times of listening to a song.
>>>>>
>>>>> Thank you Sean for the alpha value, I think they use big numbers is
>>>> because
>>>>> their values in the R matrix is big.
>>>>>
>>>>>
>>>>> 2013/3/18 Sebastian Schelter <[email protected]>
>>>>>
>>>>>> JU,
>>>>>>
>>>>>> are you refering to this dataset?
>>>>>>
>>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
>>>>>>
>>>>>> On 18.03.2013 17:47, Sean Owen wrote:
>>>>>>> One word of caution, is that there are at least two papers on ALS and
>>>>>> they
>>>>>>> define lambda differently. I think you are talking about
>> "Collaborative
>>>>>>> Filtering for Implicit Feedback Datasets".
>>>>>>>
>>>>>>> I've been working with some folks who point out that alpha=40 seems
>> to
>>>> be
>>>>>>> too high for most data sets. After running some tests on common data
>>>>>> sets,
>>>>>>> alpha=1 looks much better. YMMV.
>>>>>>>
>>>>>>> In the end you have to evaluate these two parameters, and the # of
>>>>>>> features, across a range to determine what's best.
>>>>>>>
>>>>>>> Is this data set not a bunch of audio features? I am not sure it
>> works
>>>>>> for
>>>>>>> ALS, not naturally at least.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <[email protected]>
>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm wondering has someone tried ParallelALS with implicite feedback
>>>> job
>>>>>> on
>>>>>>>> million song dataset? Some pointers on alpha and lambda?
>>>>>>>>
>>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what
>> are
>>>>>> their
>>>>>>>> r values in the matrix. They said is based on time units that users
>>>> have
>>>>>>>> watched the show, so may be it's big.
>>>>>>>>
>>>>>>>> Many thanks!
>>>>>>>> --
>>>>>>>> *JU Han*
>>>>>>>>
>>>>>>>> UTC   -  Université de Technologie de Compiègne
>>>>>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>>>>>
>>>>>>>> +33 0619608888
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
> 
>

Re: ALS-WR on Million Song dataset

Reply via email to