Re: Sparse data & Item Similarity

Chris Schilling Wed, 16 Feb 2011 21:59:34 -0800

So, 

I am currently enthralled by this discussion.  I just have a few questions 
regarding the use of similarity metrics in CF.

First.  It is apparent that when dealing with sparse data, which most CF 
systems seem to, the Pearson/cosine/Euclidean similarity metrics are not 
extremely useful.  They do seem to be very useful, however, when dealing with 
dense vectors/matrices.  

One question I have regarding the cosine similarity: it seems this is 
calculated with respect to the intersection of the two vectors.  What would 
happen if we actually divided the dot product by the total magnitudes (i.e. not 
just the magnitude of the intersection)?  Wouldn't that place more weight on 
the vectors which have more ratings in common?

Second: I agree that the likelihood approach (i.e. boolean preferences) helps a 
lot with sparse data.  So, my question is given a simple log-likelihood 
log(r/m+n) where r is the number of prefs in common and m+n is the total number 
of prefs in the two vectors, and the Pearson correlation of the intersection, 
wouldn't the product of these two approximate the true cosine similarity taking 
into account the ratings? 

The main problem with the likelihood is that it does not take into account one 
user disliking and another user liking the same item.  This seems to be more 
important in dealing with very sparse data.  However, I do understand the 
motivation, especially given that users more generally rate what they like and 
less what they dislike. 

Just trying to get a more intuitive feel for CF.  Hopefully these questions are 
not way off base...

Thanks for all the help, great work!
Chris

On Feb 16, 2011, at 9:28 PM, Lance Norskog wrote:

> If I was the business, I would analyze the "put in cart but did not
> buy" list. Negative ratings are just as useful as positive ratings.
> Possibly this gives a +1/-1 ternary value?
> 
> On Wed, Feb 16, 2011 at 8:07 PM, Ted Dunning <[email protected]> wrote:
>> My experience is that there is a very small number of events that indicates 
>> real engagement. Using them in the form of Boolean preferences helps 
>> results. A lot.
>> 
>> Using all of the other events that do not indicate engagement is a total 
>> waste of resources because you are simply teaching the machine about things 
>> you don't care about.
>> 
>> Moreover there are probably some kinds of events that vastly outnumber 
>> others. Events that are less than 1% of your can matter bit often not.
>> 
>> The valuable secret sauce you will gain is which events are which. Which 
>> make your system sing and which ones just clog up the drains.
>> 
>> Matthew wrote:
>> users can do.. "view", "add to cart", and "buy" which I've assigned
>> different preference values to. Perhaps it would be better to simply
>> use boolean yes/no in my case?
>> 
> 
> 
> 
> -- 
> Lance Norskog
> [email protected]

Re: Sparse data & Item Similarity

Reply via email to