Another option here is to use a biased ratio.  This is more common in
computing popularity.  If you take the average popularity as k_0 / n_0, then
you can estimate the popularity of a thing that has been liked/viewed/rating
k times out of n opportunities as (k + k_0) / (n + n_0).  Pick n_0 to
determine the degree of skepticism the system has and how much data it takes
to overcome its preconceived estimate of popularity.  Picking n_0 will fix
k_0 because the ratio has to match the average rate.

This trick has surprisingly deep mathematical roots and works pretty darned
well.

On Wed, Feb 16, 2011 at 10:59 PM, Sean Owen <[email protected]> wrote:

> > Second: I agree that the likelihood approach (i.e. boolean preferences)
> helps a lot with sparse data.  So, my question is given a simple
> log-likelihood log(r/m+n) where r is the number of prefs in common and m+n
> is the total number of prefs in the two vectors, and the Pearson correlation
> of the intersection, wouldn't the product of these two approximate the true
> cosine similarity taking into account the ratings?
>
> (That's not quite log-likelihood -- looks more like the Tanimoto
> coefficient of r/m+n-r. LL is something a bit more subtle.)
>
> I'd hesitate to call what you have in mind the "true" cosine
> similarity for the reason above. It's really the result of inferring 0
> for missing data, which is less true to the data.

Reply via email to