On Thu, Feb 17, 2011 at 5:58 AM, Chris Schilling <[email protected]> wrote:
> First.  It is apparent that when dealing with sparse data, which most CF 
> systems seem to, the Pearson/cosine/Euclidean similarity metrics are not 
> extremely useful.  They do seem to be very useful, however, when dealing with 
> dense vectors/matrices.

I would say that in any data set you have some pockets of dense-ness
and a large long tail of sparse-ness -- items which co-occur once for
example. And these metrics don't take much account of the chance of
coincidence or anomalous-ness. So, your results are more frequently
skewed by anomalous pairs.

This is an example of how ignoring some data can help if it's noise,
and a surprising amount of data is noise. Ignoring values entirely
here helped. It's not always true, but more than you'd imagine.

(But see below about how to make Pearson more usable in practice.)


> One question I have regarding the cosine similarity: it seems this is 
> calculated with respect to the intersection of the two vectors.  What would 
> happen if we actually divided the dot product by the total magnitudes (i.e. 
> not just the magnitude of the intersection)?  Wouldn't that place more weight 
> on the vectors which have more ratings in common?

This would have the effect of assuming that the "missing" values
(where one user has a rating and other doesn't) are 0. If "0" in your
rating system happens to correspond to a very neutral value, that's
pretty valid. If you're on a 1-5 rating system, it probably isn't. It
would be like assuming that anything you haven't watched is utterly
hated.

You could fill in the user's average rating for missing values. This
is what PreferenceInferrer does for you in the API if you like. It'll
slow things down but you can try it to combat this effect.

There's already an option in the code to weight the result by the
count of data points (items in common) -- use Weighted.WEIGHTED. It
simply pushes the result closer to 1 or -1 in a reasonable way.
There's nothing magical or particularly valid or invalid about the
math there but does what you want without some unwanted assumptions.


> Second: I agree that the likelihood approach (i.e. boolean preferences) helps 
> a lot with sparse data.  So, my question is given a simple log-likelihood 
> log(r/m+n) where r is the number of prefs in common and m+n is the total 
> number of prefs in the two vectors, and the Pearson correlation of the 
> intersection, wouldn't the product of these two approximate the true cosine 
> similarity taking into account the ratings?

(That's not quite log-likelihood -- looks more like the Tanimoto
coefficient of r/m+n-r. LL is something a bit more subtle.)

I'd hesitate to call what you have in mind the "true" cosine
similarity for the reason above. It's really the result of inferring 0
for missing data, which is less true to the data.

The product of these two similarities may be useful, but I don't know
that it has any particular mathematical interpretation.

>
> The main problem with the likelihood is that it does not take into account 
> one user disliking and another user liking the same item.  This seems to be 
> more important in dealing with very sparse data.  However, I do understand 
> the motivation, especially given that users more generally rate what they 
> like and less what they dislike.

One common approach is to segment the data into, effectively, two
boolean models: things I hate and things I like. And recommend from
both, and use the output of both to determine how relatively likely it
is you love vs  hate an item.

Reply via email to