Another option here is to use a biased ratio. This is more common in computing popularity. If you take the average popularity as k_0 / n_0, then you can estimate the popularity of a thing that has been liked/viewed/rating k times out of n opportunities as (k + k_0) / (n + n_0). Pick n_0 to determine the degree of skepticism the system has and how much data it takes to overcome its preconceived estimate of popularity. Picking n_0 will fix k_0 because the ratio has to match the average rate.
This trick has surprisingly deep mathematical roots and works pretty darned well. On Wed, Feb 16, 2011 at 10:59 PM, Sean Owen <[email protected]> wrote: > > Second: I agree that the likelihood approach (i.e. boolean preferences) > helps a lot with sparse data. So, my question is given a simple > log-likelihood log(r/m+n) where r is the number of prefs in common and m+n > is the total number of prefs in the two vectors, and the Pearson correlation > of the intersection, wouldn't the product of these two approximate the true > cosine similarity taking into account the ratings? > > (That's not quite log-likelihood -- looks more like the Tanimoto > coefficient of r/m+n-r. LL is something a bit more subtle.) > > I'd hesitate to call what you have in mind the "true" cosine > similarity for the reason above. It's really the result of inferring 0 > for missing data, which is less true to the data.
