best similarity metric for collaborative filtering

Chris Waggoner Mon, 25 Apr 2011 23:37:46 -0700

Sean Owen suggested that we move our discussion of

http://stackoverflow.com/questions/1738370/best-similarity-metric-for-collaborative-filtering/2796029#comment-6491480


to this mailing list.


Here's the background.


==QUESTION==

I'm trying to decide on the best similarity metric for a productrecommendation system using item-based collaborative filtering. This is ashopping basket scenario where ratings are binary valued - the user haseither purchased an item or not - there is no explicit rating system (eg,5-stars).

Step 1 is to compute item-to-item similarity, though I want to look atincorporating more features later on.

Is the Tanimoto coefficient the best way to go for binary values? Or arethere other metrics that are appropriate here? Thanks






==SEAN OWEN==

Chiming in on this old thread: if you have "binary" ratings (either theassociation exists or doesn't, but has no magnitude), then indeed you areforced to look at metrics like the Tanimoto / Jaccard coefficient.

However I'd suggest a log-likelihood similarity metric is significantlybetter for situations like this. Here's the code from Mahout:http://svn.apache.org/viewvc/lucene/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/LogLikelihoodSimilarity.java?view=markup



==COMMENTS==

@Sean Owen Why do you think log likelihood is better than cosine here? –Lao Tzu Mar 14 at 5:59

Cosine won't work in this situation -- there are no ratings. It will beundefined for all pairs. – Sean Owen Mar 14 at 9:59@Sean Owen Take vector A to be a {0,1} list of which users have boughtitem A. Take vector B to be a {0,1} list of which users have bought itemB. (A^t times B) over (|A| times |B|) computes the cosine similaritybetween the two items. – Lao Tzu Apr 15 at 9:07

Binary ratings are not ratings over {0,1}. That leaves three possiblestates: 0, 1, and non-existent. Ratings are always 1 where they exist.Both vectors end up being (0,0) in the cosine measure, and so the cosinemeasure is 0/0 and is undefined. This, at least, is how it would work inthe formulation Mahout uses. – Sean Owen Apr 15 at 11:28@Sean Owen As I understand @allclaws it is binary. Bought or not bought."Non-existent" is the same as not-bought. I'm not familiar with Mahout andI don't know what you mean by "Both vectors end up being (0,0) in thecosine measure". dim(A)=dim(B) = number of customers. And dim(similarity(A,B) ) = 1. So what does the (0,0) pair refer to? – Lao Tzu Apr15 at 14:02

The data are "centered" in Mahout to make the computation equivalent tothe Pearson correlation. So, the vectors would start out being like(1,1,1,1) and end up being like (0,0,0,0). The cosine measure between twodegenerate vectors is undefined. Even if you don't center -- the measurejust gives you "1" in all cases since the angle is 0. – Sean Owen Apr 15at 17:38@Sean Owen Sorry if I'm being thick. I still don't get where theconstant-value vectors are coming from. Let's say we're talking about item"Lord of the Rings DVD". User 1 bought it, users 2 thru 79 didn't, user 80bought it, ... and so on. So vec = [1, 0, 0, 0, ..., 0, 0, 1, 0, 0, ... ].There's more I don't get in what you said above but let's start there. –Lao Tzu Apr 16 at 4:38

We can discuss at [email protected] better perhaps. In my mind atleast, there are never 0 values in the vectors. They are 1, ornon-existent. Otherwise you have three states: 1, 0, or non-existent. –Sean Owen Apr 16 at 9:45





==FIN==

I've never used Mahout but what this @allclaws wants sounds like a simpleproposition. Given a vector like


bought
didn't buy
didn't buy
didn't buy
didn't buy
didn't buy
didn't buy
bought
didn't buy
bought
bought
bought

define "bought" == 1 and "didn't buy" == 0. Define distance between twosuch vectors to be { A dot B } over { |A| times |B| }. Not that I findthis compelling as a definition of similarity but @allclaws called this afirst, rough pass.

best similarity metric for collaborative filtering

Reply via email to