>> >> The main problem with the likelihood is that it does not take into account >> one user disliking and another user liking the same item. This seems to be >> more important in dealing with very sparse data. However, I do understand >> the motivation, especially given that users more generally rate what they >> like and less what they dislike. >>
> This is unfortunately really complicated. If you are talking about ratings, > then negative ratings tell you more about what somebody likes than about > what they dislike. If you are talking about implicit data, then negative > ratings are all items with which the user *might* have interacted (i.e. > roughly a skazillion things). Mostly they don't interact with these things > because something else caught their eye or they are in a bad mood. That > doesn't mean much. Its not really complicated to take into account one user liking and another user disliking the same item if an appropriate model is used. The problem with CF is that its support of sparsity is poor particualrly for big data. From: Ted Dunning Sent: Wednesday, February 16, 2011 11:33 PM To: [email protected] Cc: Chris Schilling Subject: Re: Sparse data & Item Similarity On Wed, Feb 16, 2011 at 9:58 PM, Chris Schilling <[email protected]> wrote: > First. It is apparent that when dealing with sparse data, which most CF > systems seem to, the Pearson/cosine/Euclidean similarity metrics are not > extremely useful. They do seem to be very useful, however, when dealing > with dense vectors/matrices. > Seems about right. > One question I have regarding the cosine similarity: it seems this is > calculated with respect to the intersection of the two vectors. What would > happen if we actually divided the dot product by the total magnitudes (i.e. > not just the magnitude of the intersection)? Wouldn't that place more > weight on the vectors which have more ratings in common? > Cosine is defined as the dot product over the product of the L_2 magnitudes so it is normalized to the -1 to 1 range. That isn't really the problem. The problem is cases where you have two users to rated (interacted with) exactly one item and that happens to be the same item. You can divide by the product of the L_1 or L_0 norms, but that doesn't change the situation much. Second: I agree that the likelihood approach (i.e. boolean preferences) > helps a lot with sparse data. So, my question is given a simple > log-likelihood log(r/m+n) where r is the number of prefs in common and m+n > is the total number of prefs in the two vectors, and the Pearson correlation > of the intersection, wouldn't the product of these two approximate the true > cosine similarity taking into account the ratings? > That isn't log-likelihood. It is reasonable to use something like (LLR > 10) * pearson as a measure. What this does is sparsify the pearson measure to only contain interesting values. > > The main problem with the likelihood is that it does not take into account > one user disliking and another user liking the same item. This seems to be > more important in dealing with very sparse data. However, I do understand > the motivation, especially given that users more generally rate what they > like and less what they dislike. > This is unfortunately really complicated. If you are talking about ratings, then negative ratings tell you more about what somebody likes than about what they dislike. If you are talking about implicit data, then negative ratings are all items with which the user *might* have interacted (i.e. roughly a skazillion things). Mostly they don't interact with these things because something else caught their eye or they are in a bad mood. That doesn't mean much. > > Just trying to get a more intuitive feel for CF. Hopefully these questions > are not way off base... > They are good.
