Another approach is to say that the distance measures are only interesting close up, but farther measurements are dubious. Assign a usefulness factor to each distances as maybe the log of the distance, normalized to 0->1. You can now apply fuzzy math algebra.
On Thu, Feb 17, 2011 at 1:03 AM, Dinesh B Vadhia <[email protected]> wrote: >>> >>> The main problem with the likelihood is that it does not take into account >>> one user disliking and another user liking the same item. This seems to be >>> more important in dealing with very sparse data. However, I do understand >>> the motivation, especially given that users more generally rate what they >>> like and less what they dislike. >>> > >> This is unfortunately really complicated. If you are talking about ratings, >> then negative ratings tell you more about what somebody likes than about >> what they dislike. If you are talking about implicit data, then negative >> ratings are all items with which the user *might* have interacted (i.e. >> roughly a skazillion things). Mostly they don't interact with these things >> because something else caught their eye or they are in a bad mood. That >> doesn't mean much. > > > Its not really complicated to take into account one user liking and another > user disliking the same item if an appropriate model is used. The problem > with CF is that its support of sparsity is poor particualrly for big data. > > > > From: Ted Dunning > Sent: Wednesday, February 16, 2011 11:33 PM > To: [email protected] > Cc: Chris Schilling > Subject: Re: Sparse data & Item Similarity > > > On Wed, Feb 16, 2011 at 9:58 PM, Chris Schilling <[email protected]> wrote: > >> First. It is apparent that when dealing with sparse data, which most CF >> systems seem to, the Pearson/cosine/Euclidean similarity metrics are not >> extremely useful. They do seem to be very useful, however, when dealing >> with dense vectors/matrices. >> > > Seems about right. > > >> One question I have regarding the cosine similarity: it seems this is >> calculated with respect to the intersection of the two vectors. What would >> happen if we actually divided the dot product by the total magnitudes (i.e. >> not just the magnitude of the intersection)? Wouldn't that place more >> weight on the vectors which have more ratings in common? >> > > Cosine is defined as the dot product over the product of the L_2 magnitudes > so it is normalized to the -1 to 1 range. > > That isn't really the problem. The problem is cases where you have two > users to rated (interacted with) exactly one item and that happens to be the > same item. > > You can divide by the product of the L_1 or L_0 norms, but that doesn't > change the situation much. > > Second: I agree that the likelihood approach (i.e. boolean preferences) >> helps a lot with sparse data. So, my question is given a simple >> log-likelihood log(r/m+n) where r is the number of prefs in common and m+n >> is the total number of prefs in the two vectors, and the Pearson correlation >> of the intersection, wouldn't the product of these two approximate the true >> cosine similarity taking into account the ratings? >> > > That isn't log-likelihood. > > It is reasonable to use something like (LLR > 10) * pearson as a measure. > What this does is sparsify the pearson measure to only contain interesting > values. > > >> >> The main problem with the likelihood is that it does not take into account >> one user disliking and another user liking the same item. This seems to be >> more important in dealing with very sparse data. However, I do understand >> the motivation, especially given that users more generally rate what they >> like and less what they dislike. >> > > This is unfortunately really complicated. If you are talking about ratings, > then negative ratings tell you more about what somebody likes than about > what they dislike. If you are talking about implicit data, then negative > ratings are all items with which the user *might* have interacted (i.e. > roughly a skazillion things). Mostly they don't interact with these things > because something else caught their eye or they are in a bad mood. That > doesn't mean much. > > >> >> Just trying to get a more intuitive feel for CF. Hopefully these questions >> are not way off base... >> > > They are good. > -- Lance Norskog [email protected]
