Re: LogLikelihoodSimilarity calculation

Ted Dunning Wed, 10 Sep 2014 20:44:14 -0700

It might help to look at the matrices that result:

First I defined a function in R to generate the contingency tables:


> f = function(k11, n1=100, n2=900000, n=1e6){matrix(c(k11, n1-k11, n2-k11,
n-n1-n2+k11),
nrow=2)}

One of your examples is this one

> f(90)
     [,1]   [,2]
[1,]   90 899910
[2,]   10  99990

Notice how the two columns are basically the same except for a scaling
factor.

Here is your other example

> f(10)
     [,1]   [,2]
[1,]   10 899990
[2,]   90  99910

Now what we have is that in the first column, row 2 is bigger while in the
second column, row 1 is bigger.  That is, the distributions are quite
different.

Here is the actual LLR score for the first example:

> llr(f(90))
[1] -1.275022e-10

(the negative sign is spurious and hte result of round-off error.  The real
result is basically just 0)

And for the second:

> llr(f(10))
[1] 351.6271

Here we see a huge value which says that (as we saw), the distributions are
different.

For reference, here is the R code for llr:

> llr
function(k) {2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))}
> H
function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0))))}


On Wed, Sep 10, 2014 at 2:32 PM, Aishwarya Srivastava <
[email protected]> wrote:

> Hi Dmitriy,
>
> I am following the same calculation used in the userSimilarity method in
> LogLikelihoodSimilarity.java
>
> k11 = intersectionSize       (both users viewed movie)
>
> k12 = prefs2Size - intersectionSize   (only viewed by user 2)
>
> k21 = prefs1Size - intersectionSize    (only viewed by user 1)
>
> k22 = numItems- prefs1Size - prefs2Size + intersectionSize  (not viewed by
> both 1 and 2)
>
>
> Thanks,
>
> Aishwarya
>
> On Wed, Sep 10, 2014 at 2:25 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > how do you compute k11, k12... values exactly?
> >
> > On Wed, Sep 10, 2014 at 1:55 PM, aishsesh <
> > [email protected]> wrote:
> >
> > > Hi,
> > >
> > > I have the following case where numItems = 1,000,000, prefs1Size =
> > 900,000
> > > and prefs2Size = 100.
> > >
> > > It is the case when i have two users, one who has seen 90% of the
> movies
> > in
> > > the database and another only 100 of the million movies. Suppose they
> > have
> > > 90 movies in common (user 2 has seen only 100 movies totally), i would
> > > assume the similarity to be high compared to when they have only 10
> > movies
> > > in common. But the similarities i am getting are
> > > 0.9971 for intersection size 10 and
> > > 0 for intersection size 90.
> > >
> > > This seems counter intuitive.
> > >
> > > Am i missing something? Is there an explanation for the above mentioned
> > > values?
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/LogLikelihoodSimilarity-calculation-tp4158035.html
> > > Sent from the Mahout User List mailing list archive at Nabble.com.
> > >
> >
>

Re: LogLikelihoodSimilarity calculation

Reply via email to