It might help to look at the matrices that result:
First I defined a function in R to generate the contingency tables:
> f = function(k11, n1=100, n2=900000, n=1e6){matrix(c(k11, n1-k11, n2-k11,
n-n1-n2+k11),
nrow=2)}
One of your examples is this one
> f(90)
[,1] [,2]
[1,] 90 899910
[2,] 10 99990
Notice how the two columns are basically the same except for a scaling
factor.
Here is your other example
> f(10)
[,1] [,2]
[1,] 10 899990
[2,] 90 99910
Now what we have is that in the first column, row 2 is bigger while in the
second column, row 1 is bigger. That is, the distributions are quite
different.
Here is the actual LLR score for the first example:
> llr(f(90))
[1] -1.275022e-10
(the negative sign is spurious and hte result of round-off error. The real
result is basically just 0)
And for the second:
> llr(f(10))
[1] 351.6271
Here we see a huge value which says that (as we saw), the distributions are
different.
For reference, here is the R code for llr:
> llr
function(k) {2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))}
> H
function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0))))}
On Wed, Sep 10, 2014 at 2:32 PM, Aishwarya Srivastava <
[email protected]> wrote:
> Hi Dmitriy,
>
> I am following the same calculation used in the userSimilarity method in
> LogLikelihoodSimilarity.java
>
> k11 = intersectionSize (both users viewed movie)
>
> k12 = prefs2Size - intersectionSize (only viewed by user 2)
>
> k21 = prefs1Size - intersectionSize (only viewed by user 1)
>
> k22 = numItems- prefs1Size - prefs2Size + intersectionSize (not viewed by
> both 1 and 2)
>
>
> Thanks,
>
> Aishwarya
>
> On Wed, Sep 10, 2014 at 2:25 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > how do you compute k11, k12... values exactly?
> >
> > On Wed, Sep 10, 2014 at 1:55 PM, aishsesh <
> > [email protected]> wrote:
> >
> > > Hi,
> > >
> > > I have the following case where numItems = 1,000,000, prefs1Size =
> > 900,000
> > > and prefs2Size = 100.
> > >
> > > It is the case when i have two users, one who has seen 90% of the
> movies
> > in
> > > the database and another only 100 of the million movies. Suppose they
> > have
> > > 90 movies in common (user 2 has seen only 100 movies totally), i would
> > > assume the similarity to be high compared to when they have only 10
> > movies
> > > in common. But the similarities i am getting are
> > > 0.9971 for intersection size 10 and
> > > 0 for intersection size 90.
> > >
> > > This seems counter intuitive.
> > >
> > > Am i missing something? Is there an explanation for the above mentioned
> > > values?
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/LogLikelihoodSimilarity-calculation-tp4158035.html
> > > Sent from the Mahout User List mailing list archive at Nabble.com.
> > >
> >
>