Sean Owen suggested that we move our discussion of

http://stackoverflow.com/questions/1738370/best-similarity-metric-for-collaborative-filtering/2796029#comment-6491480

to this mailing list.


Here's the background.


==QUESTION==
I'm trying to decide on the best similarity metric for a product recommendation system using item-based collaborative filtering. This is a shopping basket scenario where ratings are binary valued - the user has either purchased an item or not - there is no explicit rating system (eg, 5-stars).

Step 1 is to compute item-to-item similarity, though I want to look at incorporating more features later on.

Is the Tanimoto coefficient the best way to go for binary values? Or are there other metrics that are appropriate here? Thanks





==SEAN OWEN==
Chiming in on this old thread: if you have "binary" ratings (either the association exists or doesn't, but has no magnitude), then indeed you are forced to look at metrics like the Tanimoto / Jaccard coefficient.

However I'd suggest a log-likelihood similarity metric is significantly better for situations like this. Here's the code from Mahout: http://svn.apache.org/viewvc/lucene/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/LogLikelihoodSimilarity.java?view=markup


==COMMENTS==
@Sean Owen Why do you think log likelihood is better than cosine here? – Lao Tzu Mar 14 at 5:59


Cosine won't work in this situation -- there are no ratings. It will be undefined for all pairs. – Sean Owen Mar 14 at 9:59 @Sean Owen Take vector A to be a {0,1} list of which users have bought item A. Take vector B to be a {0,1} list of which users have bought item B. (A^t times B) over (|A| times |B|) computes the cosine similarity between the two items. – Lao Tzu Apr 15 at 9:07


Binary ratings are not ratings over {0,1}. That leaves three possible states: 0, 1, and non-existent. Ratings are always 1 where they exist. Both vectors end up being (0,0) in the cosine measure, and so the cosine measure is 0/0 and is undefined. This, at least, is how it would work in the formulation Mahout uses. – Sean Owen Apr 15 at 11:28 @Sean Owen As I understand @allclaws it is binary. Bought or not bought. "Non-existent" is the same as not-bought. I'm not familiar with Mahout and I don't know what you mean by "Both vectors end up being (0,0) in the cosine measure". dim(A)=dim(B) = number of customers. And dim( similarity(A,B) ) = 1. So what does the (0,0) pair refer to? – Lao Tzu Apr 15 at 14:02


The data are "centered" in Mahout to make the computation equivalent to the Pearson correlation. So, the vectors would start out being like (1,1,1,1) and end up being like (0,0,0,0). The cosine measure between two degenerate vectors is undefined. Even if you don't center -- the measure just gives you "1" in all cases since the angle is 0. – Sean Owen Apr 15 at 17:38 @Sean Owen Sorry if I'm being thick. I still don't get where the constant-value vectors are coming from. Let's say we're talking about item "Lord of the Rings DVD". User 1 bought it, users 2 thru 79 didn't, user 80 bought it, ... and so on. So vec = [1, 0, 0, 0, ..., 0, 0, 1, 0, 0, ... ]. There's more I don't get in what you said above but let's start there. – Lao Tzu Apr 16 at 4:38


We can discuss at [email protected] better perhaps. In my mind at least, there are never 0 values in the vectors. They are 1, or non-existent. Otherwise you have three states: 1, 0, or non-existent. – Sean Owen Apr 16 at 9:45




==FIN==


I've never used Mahout but what this @allclaws wants sounds like a simple proposition. Given a vector like

bought
didn't buy
didn't buy
didn't buy
didn't buy
didn't buy
didn't buy
bought
didn't buy
bought
bought
bought



define "bought" == 1 and "didn't buy" == 0. Define distance between two such vectors to be { A dot B } over { |A| times |B| }. Not that I find this compelling as a definition of similarity but @allclaws called this a first, rough pass.

Reply via email to