Re: Sparse data & Item Similarity

Lance Norskog Thu, 17 Feb 2011 21:35:19 -0800

Another approach is to say that the distance measures are only
interesting close up, but farther measurements are dubious.  Assign a
usefulness factor to each distances as maybe the log of the distance,
normalized to 0->1. You can now apply fuzzy math algebra.


On Thu, Feb 17, 2011 at 1:03 AM, Dinesh B Vadhia
<[email protected]> wrote:
>>>
>>> The main problem with the likelihood is that it does not take into account
>>> one user disliking and another user liking the same item.  This seems to be
>>> more important in dealing with very sparse data.  However, I do understand
>>> the motivation, especially given that users more generally rate what they
>>> like and less what they dislike.
>>>
>
>> This is unfortunately really complicated.  If you are talking about ratings,
>> then negative ratings tell you more about what somebody likes than about
>> what they dislike.  If you are talking about implicit data, then negative
>> ratings are all items with which the user *might* have interacted (i.e.
>> roughly a skazillion things).  Mostly they don't interact with these things
>> because something else caught their eye or they are in a bad mood.  That
>> doesn't mean much.
>
>
> Its not really complicated to take into account one user liking and another 
> user disliking the same item if an appropriate model is used.  The problem 
> with CF is that its support of sparsity is poor particualrly for big data.
>
>
>
> From: Ted Dunning
> Sent: Wednesday, February 16, 2011 11:33 PM
> To: [email protected]
> Cc: Chris Schilling
> Subject: Re: Sparse data & Item Similarity
>
>
> On Wed, Feb 16, 2011 at 9:58 PM, Chris Schilling <[email protected]> wrote:
>
>> First.  It is apparent that when dealing with sparse data, which most CF
>> systems seem to, the Pearson/cosine/Euclidean similarity metrics are not
>> extremely useful.  They do seem to be very useful, however, when dealing
>> with dense vectors/matrices.
>>
>
> Seems about right.
>
>
>> One question I have regarding the cosine similarity: it seems this is
>> calculated with respect to the intersection of the two vectors.  What would
>> happen if we actually divided the dot product by the total magnitudes (i.e.
>> not just the magnitude of the intersection)?  Wouldn't that place more
>> weight on the vectors which have more ratings in common?
>>
>
> Cosine is defined as the dot product over the product of the L_2 magnitudes
> so it is normalized to the -1 to 1 range.
>
> That isn't really the problem.  The problem is cases where you have two
> users to rated (interacted with) exactly one item and that happens to be the
> same item.
>
> You can divide by the product of the L_1 or L_0 norms, but that doesn't
> change the situation much.
>
> Second: I agree that the likelihood approach (i.e. boolean preferences)
>> helps a lot with sparse data.  So, my question is given a simple
>> log-likelihood log(r/m+n) where r is the number of prefs in common and m+n
>> is the total number of prefs in the two vectors, and the Pearson correlation
>> of the intersection, wouldn't the product of these two approximate the true
>> cosine similarity taking into account the ratings?
>>
>
> That isn't log-likelihood.
>
> It is reasonable to use something like  (LLR > 10) * pearson as a measure.
>  What this does is sparsify the pearson measure to only contain interesting
> values.
>
>
>>
>> The main problem with the likelihood is that it does not take into account
>> one user disliking and another user liking the same item.  This seems to be
>> more important in dealing with very sparse data.  However, I do understand
>> the motivation, especially given that users more generally rate what they
>> like and less what they dislike.
>>
>
> This is unfortunately really complicated.  If you are talking about ratings,
> then negative ratings tell you more about what somebody likes than about
> what they dislike.  If you are talking about implicit data, then negative
> ratings are all items with which the user *might* have interacted (i.e.
> roughly a skazillion things).  Mostly they don't interact with these things
> because something else caught their eye or they are in a bad mood.  That
> doesn't mean much.
>
>
>>
>> Just trying to get a more intuitive feel for CF.  Hopefully these questions
>> are not way off base...
>>
>
> They are good.
>



-- 
Lance Norskog
[email protected]

Re: Sparse data & Item Similarity

Reply via email to