Re: LogLikelihoodSimilarity calculation

Pat Ferrel Thu, 11 Sep 2014 10:37:23 -0700

Actually this is a great example of why LLR works. Intuitively speaking what do 
you know about the taste of someone who prefers 90% of items? Nothing, they 
watch them all. What value are the cooccurrences in movie watches? None in fact 
I’d look to see if pref1 isn’t an anomaly caused by a crawler or something. 
Intuitively speaking LLR found the non-diferentiating user and properly ignored 
them.



On Sep 10, 2014, at 8:43 PM, Ted Dunning <[email protected]> wrote:

It might help to look at the matrices that result:

First I defined a function in R to generate the contingency tables:

> f = function(k11, n1=100, n2=900000, n=1e6){matrix(c(k11, n1-k11, n2-k11,
n-n1-n2+k11),
nrow=2)}

One of your examples is this one

> f(90)
    [,1]   [,2]
[1,]   90 899910
[2,]   10  99990

Notice how the two columns are basically the same except for a scaling
factor.

Here is your other example

> f(10)
    [,1]   [,2]
[1,]   10 899990
[2,]   90  99910

Now what we have is that in the first column, row 2 is bigger while in the
second column, row 1 is bigger.  That is, the distributions are quite
different.

Here is the actual LLR score for the first example:

> llr(f(90))
[1] -1.275022e-10

(the negative sign is spurious and hte result of round-off error.  The real
result is basically just 0)

And for the second:

> llr(f(10))
[1] 351.6271

Here we see a huge value which says that (as we saw), the distributions are
different.

For reference, here is the R code for llr:

> llr
function(k) {2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))}
> H
function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0))))}


On Wed, Sep 10, 2014 at 2:32 PM, Aishwarya Srivastava <
[email protected]> wrote:

> Hi Dmitriy,
> 
> I am following the same calculation used in the userSimilarity method in
> LogLikelihoodSimilarity.java
> 
> k11 = intersectionSize       (both users viewed movie)
> 
> k12 = prefs2Size - intersectionSize   (only viewed by user 2)
> 
> k21 = prefs1Size - intersectionSize    (only viewed by user 1)
> 
> k22 = numItems- prefs1Size - prefs2Size + intersectionSize  (not viewed by
> both 1 and 2)
> 
> 
> Thanks,
> 
> Aishwarya
> 
> On Wed, Sep 10, 2014 at 2:25 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
> 
>> how do you compute k11, k12... values exactly?
>> 
>> On Wed, Sep 10, 2014 at 1:55 PM, aishsesh <
>> [email protected]> wrote:
>> 
>>> Hi,
>>> 
>>> I have the following case where numItems = 1,000,000, prefs1Size =
>> 900,000
>>> and prefs2Size = 100.
>>> 
>>> It is the case when i have two users, one who has seen 90% of the
> movies
>> in
>>> the database and another only 100 of the million movies. Suppose they
>> have
>>> 90 movies in common (user 2 has seen only 100 movies totally), i would
>>> assume the similarity to be high compared to when they have only 10
>> movies
>>> in common. But the similarities i am getting are
>>> 0.9971 for intersection size 10 and
>>> 0 for intersection size 90.
>>> 
>>> This seems counter intuitive.
>>> 
>>> Am i missing something? Is there an explanation for the above mentioned
>>> values?
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>>> 
>> 
> http://lucene.472066.n3.nabble.com/LogLikelihoodSimilarity-calculation-tp4158035.html
>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>> 
>> 
>

Re: LogLikelihoodSimilarity calculation

Reply via email to