Actually this is a great example of why LLR works. Intuitively speaking what do you know about the taste of someone who prefers 90% of items? Nothing, they watch them all. What value are the cooccurrences in movie watches? None in fact I’d look to see if pref1 isn’t an anomaly caused by a crawler or something. Intuitively speaking LLR found the non-diferentiating user and properly ignored them.
On Sep 10, 2014, at 8:43 PM, Ted Dunning <[email protected]> wrote: It might help to look at the matrices that result: First I defined a function in R to generate the contingency tables: > f = function(k11, n1=100, n2=900000, n=1e6){matrix(c(k11, n1-k11, n2-k11, n-n1-n2+k11), nrow=2)} One of your examples is this one > f(90) [,1] [,2] [1,] 90 899910 [2,] 10 99990 Notice how the two columns are basically the same except for a scaling factor. Here is your other example > f(10) [,1] [,2] [1,] 10 899990 [2,] 90 99910 Now what we have is that in the first column, row 2 is bigger while in the second column, row 1 is bigger. That is, the distributions are quite different. Here is the actual LLR score for the first example: > llr(f(90)) [1] -1.275022e-10 (the negative sign is spurious and hte result of round-off error. The real result is basically just 0) And for the second: > llr(f(10)) [1] 351.6271 Here we see a huge value which says that (as we saw), the distributions are different. For reference, here is the R code for llr: > llr function(k) {2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))} > H function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0))))} On Wed, Sep 10, 2014 at 2:32 PM, Aishwarya Srivastava < [email protected]> wrote: > Hi Dmitriy, > > I am following the same calculation used in the userSimilarity method in > LogLikelihoodSimilarity.java > > k11 = intersectionSize (both users viewed movie) > > k12 = prefs2Size - intersectionSize (only viewed by user 2) > > k21 = prefs1Size - intersectionSize (only viewed by user 1) > > k22 = numItems- prefs1Size - prefs2Size + intersectionSize (not viewed by > both 1 and 2) > > > Thanks, > > Aishwarya > > On Wed, Sep 10, 2014 at 2:25 PM, Dmitriy Lyubimov <[email protected]> > wrote: > >> how do you compute k11, k12... values exactly? >> >> On Wed, Sep 10, 2014 at 1:55 PM, aishsesh < >> [email protected]> wrote: >> >>> Hi, >>> >>> I have the following case where numItems = 1,000,000, prefs1Size = >> 900,000 >>> and prefs2Size = 100. >>> >>> It is the case when i have two users, one who has seen 90% of the > movies >> in >>> the database and another only 100 of the million movies. Suppose they >> have >>> 90 movies in common (user 2 has seen only 100 movies totally), i would >>> assume the similarity to be high compared to when they have only 10 >> movies >>> in common. But the similarities i am getting are >>> 0.9971 for intersection size 10 and >>> 0 for intersection size 90. >>> >>> This seems counter intuitive. >>> >>> Am i missing something? Is there an explanation for the above mentioned >>> values? >>> >>> >>> >>> -- >>> View this message in context: >>> >> > http://lucene.472066.n3.nabble.com/LogLikelihoodSimilarity-calculation-tp4158035.html >>> Sent from the Mahout User List mailing list archive at Nabble.com. >>> >> >
