RE: A bunch of SVD questions...

Razon, Oren Fri, 06 Jul 2012 13:40:16 -0700

Thanks Sean
I've accidently continued this thread under the thread you opened, so I'm 
moving back to my thread :)


I will rephrase the question I've asked there.
Let's say that as part of my held-out test my model find for user u2 connection 
to i1 has strength of 28.94 to i2 17.9 and to i3 4.5.
The ranking itself which I have (hidden) is on scale of 1-5 (or even binary 0\1 
for an example).

Now how could I estimate the ranking I gave for u2 if I only predicted the 
connection strength he has with each item in order to rank the items while my 
data is on different scale?
In other words, the problem definition here is not prediction but ranking, 
therefor I guess it should have different measures than prediction measures...

Am I missing something?

If familiar with precision \ recall \ ROC \ Lift and so on, but not sure I 
understand how should I use them here.

-----Original Message-----
From: Sean Owen [mailto:[email protected]] 
Sent: Thursday, July 05, 2012 15:59
To: [email protected]
Subject: Re: A bunch of SVD questions...

Unless you are recommending users to items too, you don't have a cold
start problem for items. If you are, you can apply the same technique.
Using fold-in, you can create a reasonable user or item vector from
the time you have the very first interaction for the user or item,
which solves most of the cold start problem without resorting to
simple top-10 lists.

You can certainly compute user-user and item-item similarity on the
factored matrices. It's a good approximation and is faster. Cosine
measure works fine in this space.

Look at finding someone's bootleg copy of the Netflix data set, or the
KDD cup data set. I am using StackOverflow and Wikipedia dumps as good
sources of a big data set though you need to massage it to get it into
a usable form.

Sean

On Thu, Jul 5, 2012 at 3:22 PM, Razon, Oren <[email protected]> wrote:
> Thanks.
> I had some other questions in mind so I will use this post...
>
> 1. Cold start for items problem - With the user cold start problem I can 
> handle by trying new items for the user based on popularity \ randomly.
> But what options do I have when using the ALS \ co-occurrence matrix to 
> overcome cold start for item?
>
> 2. What about applying a matrix factorization technique (ALS \ SVD) as a 
> preprocessing.
> Meaning, after doing the factorization, use the new lower Item matrix for 
> example to compute item similarity between items? Will it be a good idea?
>
> 3. I'm looking for a huge data set to try my recommender on. I'm searching 
> something which is even bigger than last.fm\ libimseti can anyone recommend 
> on such dataset?
>
> Thanks,
> Oren
>
>
> -----Original Message-----
> From: Sebastian Schelter [mailto:[email protected]]
> Sent: Thursday, July 05, 2012 12:46
> To: [email protected]
> Subject: Re: A bunch of SVD questions...
>
> There is only one implementation, because both 'flavors' of ALS have the
> same computation shape. The default mode is to factorize explicit
> feedback data and if you specifiy the option '--implicitFeedback', it
> will switch to the algorithm that works on implicit feedback data.
> Internally the different solver from org.apache.mahout.math.als are used
> if you want to have a deeper look.
>
> Best,
> Sebatian
>
> On 05.07.2012 10:38, Razon, Oren wrote:
>> Thanks for the answer Sebastian!
>> You said mahout has two 'flavors' of the ALS factorization, one for implicit 
>> and the other for explicit.
>> Can you direct me which code do what?
>> Cause on the Hadoop part I can see only one ALS implementation...
>>
>> -----Original Message-----
>> From: Sebastian Schelter [mailto:[email protected]]
>> Sent: Thursday, July 05, 2012 11:12
>> To: [email protected]
>> Subject: Re: A bunch of SVD questions...
>>
>> 1. You can use org.apache.mahout.cf.taste.hadoop.als.RecommenderJob to
>> compute top-N recommendations from the factorization in batch. For
>> each user, you have to compute the product of the item feature matrix
>> and his feature vector and pick the highest ranking unknown items
>> after that.
>>
>> 2. The semantics of the empty cells depends on the type of data you
>> have. For explicit feedback (ratings), you cannot fill the empty cells
>> because you simply don't know what rating the user would have given.
>> For implicit feedback a cell usually holds the count of some observed
>> behavior like clicks e.g. Here empty cells are by definition 0 (no
>> clicks observed), however the factorization has to be modified to give
>> 'lower confidence' to these datapoints.
>>
>> 3. There are two 'flavors' of the ALS factorzation implemented in
>> Mahout, one for implicit feedback data, the other for explicit
>> feedback data, I suggest you look into the papers they are based on:
>>
>> "Large-scale Parallel Collaborative Filtering for the Netﬂix Prize"
>> http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf
>> "Collaborative Filtering for Implicit Feedback Datasets"
>> http://research.yahoo.com/pub/2433
>>
>> I also uploaded the slides from a lecture I gave at a scalable data
>> mining class at our department, they might also be helpful in
>> understanding the topic:
>>
>> http://www.slideshare.net/sscdotopen/latent-factor-models-for-collaborative-filtering
>>
>> Best,
>> Sebastian
>> 2012/7/4 Razon, Oren <[email protected]>:
>>> Hi,
>>> I'm exploring Mahout SVD parallel implementation over Hadoop (ALS), and I 
>>> would like to clarify a few things :
>>> 1.      How do you recommend top K items with this job? Does the job 
>>> factorize the ranking matrix, than compute a predicted ranking for each 
>>> cell in the matrix, so when you need a recommendation you only need to 
>>> retrieve the top K items according to prediction value for the user? Or is 
>>> it factorize the matrix and require some online logic when the 
>>> recommendation is being asked?
>>> 2.      From my knowledge, applying a SVD technique require first to fill 
>>> in all empty cells in the ranking matrix (with average ranking for 
>>> example). Is it something done during the ALS job (and if so, what is the 
>>> way it's being filled), or should it be done as a preprocessing step?
>>> 3.      From my understanding SVD recommenders are used to predict user 
>>> implicit preference. By doing so you can recommend top K items (top K items 
>>> over descending orders according to the prediction). I wonder, could it be 
>>> applied on a binary dataset (explicit), where my ranking matrix contain 
>>> only 1\0?
>>> 4.      From doing some readings I found that the timeSVD++ developed by 
>>> Yehuda Koren is considered as the superior SVD implementation for SVD 
>>> recommenders. I wondered if there is any kind of a parallel implementation 
>>> of it on top of Hadoop? I found this proposal: 
>>> https://issues.apache.org/jira/browse/MAHOUT-371
>>>       I wonder, what is the status of it? Was it being checked already? Is 
>>> it stable? Did anyone experienced with it?
>>>
>>> Thanks,
>>> Oren
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> Intel Electronics Ltd.
>>>
>>> This e-mail and any attachments may contain confidential material for
>>> the sole use of the intended recipient(s). Any review or distribution
>>> by others is strictly prohibited. If you are not the intended
>>> recipient, please contact the sender and delete all copies.
>> ---------------------------------------------------------------------
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>>
>
>
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

RE: A bunch of SVD questions...

Reply via email to