Sean Owen suggested that we move our discussion of
http://stackoverflow.com/questions/1738370/best-similarity-metric-for-collaborative-filtering/2796029#comment-6491480
to this mailing list.
Here's the background.
==QUESTION==
I'm trying to decide on the best similarity metric for a product
recommendation system using item-based collaborative filtering. This is a
shopping basket scenario where ratings are binary valued - the user has
either purchased an item or not - there is no explicit rating system (eg,
5-stars).
Step 1 is to compute item-to-item similarity, though I want to look at
incorporating more features later on.
Is the Tanimoto coefficient the best way to go for binary values? Or are
there other metrics that are appropriate here? Thanks
==SEAN OWEN==
Chiming in on this old thread: if you have "binary" ratings (either the
association exists or doesn't, but has no magnitude), then indeed you are
forced to look at metrics like the Tanimoto / Jaccard coefficient.
However I'd suggest a log-likelihood similarity metric is significantly
better for situations like this. Here's the code from Mahout:
http://svn.apache.org/viewvc/lucene/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/LogLikelihoodSimilarity.java?view=markup
==COMMENTS==
@Sean Owen Why do you think log likelihood is better than cosine here? –
Lao Tzu Mar 14 at 5:59
Cosine won't work in this situation -- there are no ratings. It will be
undefined for all pairs. – Sean Owen Mar 14 at 9:59
@Sean Owen Take vector A to be a {0,1} list of which users have bought
item A. Take vector B to be a {0,1} list of which users have bought item
B. (A^t times B) over (|A| times |B|) computes the cosine similarity
between the two items. – Lao Tzu Apr 15 at 9:07
Binary ratings are not ratings over {0,1}. That leaves three possible
states: 0, 1, and non-existent. Ratings are always 1 where they exist.
Both vectors end up being (0,0) in the cosine measure, and so the cosine
measure is 0/0 and is undefined. This, at least, is how it would work in
the formulation Mahout uses. – Sean Owen Apr 15 at 11:28
@Sean Owen As I understand @allclaws it is binary. Bought or not bought.
"Non-existent" is the same as not-bought. I'm not familiar with Mahout and
I don't know what you mean by "Both vectors end up being (0,0) in the
cosine measure". dim(A)=dim(B) = number of customers. And dim(
similarity(A,B) ) = 1. So what does the (0,0) pair refer to? – Lao Tzu Apr
15 at 14:02
The data are "centered" in Mahout to make the computation equivalent to
the Pearson correlation. So, the vectors would start out being like
(1,1,1,1) and end up being like (0,0,0,0). The cosine measure between two
degenerate vectors is undefined. Even if you don't center -- the measure
just gives you "1" in all cases since the angle is 0. – Sean Owen Apr 15
at 17:38
@Sean Owen Sorry if I'm being thick. I still don't get where the
constant-value vectors are coming from. Let's say we're talking about item
"Lord of the Rings DVD". User 1 bought it, users 2 thru 79 didn't, user 80
bought it, ... and so on. So vec = [1, 0, 0, 0, ..., 0, 0, 1, 0, 0, ... ].
There's more I don't get in what you said above but let's start there. –
Lao Tzu Apr 16 at 4:38
We can discuss at [email protected] better perhaps. In my mind at
least, there are never 0 values in the vectors. They are 1, or
non-existent. Otherwise you have three states: 1, 0, or non-existent. –
Sean Owen Apr 16 at 9:45
==FIN==
I've never used Mahout but what this @allclaws wants sounds like a simple
proposition. Given a vector like
bought
didn't buy
didn't buy
didn't buy
didn't buy
didn't buy
didn't buy
bought
didn't buy
bought
bought
bought
define "bought" == 1 and "didn't buy" == 0. Define distance between two
such vectors to be { A dot B } over { |A| times |B| }. Not that I find
this compelling as a definition of similarity but @allclaws called this a
first, rough pass.