So, I am currently enthralled by this discussion. I just have a few questions regarding the use of similarity metrics in CF.
First. It is apparent that when dealing with sparse data, which most CF systems seem to, the Pearson/cosine/Euclidean similarity metrics are not extremely useful. They do seem to be very useful, however, when dealing with dense vectors/matrices. One question I have regarding the cosine similarity: it seems this is calculated with respect to the intersection of the two vectors. What would happen if we actually divided the dot product by the total magnitudes (i.e. not just the magnitude of the intersection)? Wouldn't that place more weight on the vectors which have more ratings in common? Second: I agree that the likelihood approach (i.e. boolean preferences) helps a lot with sparse data. So, my question is given a simple log-likelihood log(r/m+n) where r is the number of prefs in common and m+n is the total number of prefs in the two vectors, and the Pearson correlation of the intersection, wouldn't the product of these two approximate the true cosine similarity taking into account the ratings? The main problem with the likelihood is that it does not take into account one user disliking and another user liking the same item. This seems to be more important in dealing with very sparse data. However, I do understand the motivation, especially given that users more generally rate what they like and less what they dislike. Just trying to get a more intuitive feel for CF. Hopefully these questions are not way off base... Thanks for all the help, great work! Chris On Feb 16, 2011, at 9:28 PM, Lance Norskog wrote: > If I was the business, I would analyze the "put in cart but did not > buy" list. Negative ratings are just as useful as positive ratings. > Possibly this gives a +1/-1 ternary value? > > On Wed, Feb 16, 2011 at 8:07 PM, Ted Dunning <[email protected]> wrote: >> My experience is that there is a very small number of events that indicates >> real engagement. Using them in the form of Boolean preferences helps >> results. A lot. >> >> Using all of the other events that do not indicate engagement is a total >> waste of resources because you are simply teaching the machine about things >> you don't care about. >> >> Moreover there are probably some kinds of events that vastly outnumber >> others. Events that are less than 1% of your can matter bit often not. >> >> The valuable secret sauce you will gain is which events are which. Which >> make your system sing and which ones just clog up the drains. >> >> Matthew wrote: >> users can do.. "view", "add to cart", and "buy" which I've assigned >> different preference values to. Perhaps it would be better to simply >> use boolean yes/no in my case? >> > > > > -- > Lance Norskog > [email protected]
