Re: LogLikelihoodSimilarity calculation

Ted Dunning Fri, 26 Sep 2014 15:49:31 -0700

Aishwarya,

The two matrices in question are this one for overlap == 10,


     [,1]   [,2]
[1,]   10 899990
[2,]   90  99910

and this one for overlap == 20

     [,1]   [,2]
[1,]   20 899980
[2,]   80  99920

Column 1 here represents prefs1, column 2 represents not(prefs1), row 1
represents prefs2, row2 represents not(prefs2).

As you can see in the first case, the two columns are different in trend.
The first column has a top element that is smaller and the second column
has a top element which is larger.  This is a highly unusual situation for
cooccurrence, of course, because most things are actually kind of rare.

So what is happening here is that as you move from intersection of 10 to
intersection of 20, the distributions in column 1 and column 2 are becoming
*more* alike than before.  Thus the score goes from
ultra-super-hyper-mega-massive score of 350 to the merely
ultra-super-hyper-mega score of 270.

A more interesting example is where you have two items with each occur only
100 times of of the million in total.  Now if these two co-occur 10 times,
we get this:

     [,1]   [,2]
[1,]   10     90
[2,]   90 999810

which gives and LLR score of 120.  This represents a huge score because two
events which occur 1/10,000 of the time occur in the presence of the other
at a rate of 10%.  This is 1000x lift.

For a cooccurrence of 20, we get this matrix:

     [,1]   [,2]
[1,]   20     90
[2,]   90 999800

And now the LLR goes to 264.  The lift in frequency is now 2000x so the
score is much higher.

Does this help?




On Fri, Sep 26, 2014 at 11:55 AM, Aishwarya Srivastava <
aishwarya.srivast...@bloomreach.com> wrote:

> Hi Ted & Pat,
>
> Thank you so much for your answers. I can see now why it makes sense that
> in my first case (intersection size 90), LLR returns zero.
>
> But with the second case i still don't understand one thing. To restate the
> problem,
> I have the following case where numItems = 1,000,000, prefs1Size =
> 900,000, prefs2Size
> = 100 and intersection size is 10.
> The matrix that Ted generated is
>      [,1]   [,2]
> [1,]   10 899990
> [2,]   90  99910
>
> The LLR for this case is 351.6270674569532. I understand this.
>
> But if the users had so little in common, is this not dissimilarity?
>
> In Mahout today i am getting a score of 0.9971.
>
> If however the intersection size is 20, then the values for  are 272.6 for
> LLR and mahout's similarity score is 0.9963.
>
> I get how the LLR should decrease. But is an intersection of 20 not more
> similar than an intersection of 10.
>
> Sorry if i am missing something very obvious.
>
> Thanks,
> Aishwarya
>
>
> On Tue, Sep 23, 2014 at 3:03 AM, <mario.al...@gmail.com> wrote:
>
> > > Can you correct me if so?
> >
> > This makes me doubt about the correctness of my point:) The best is
> > possibly to write down a small example, with numbers and formulae. It's
> the
> > best way either to see if I missed the point, or there is actually a
> subtle
> > truth....
> >
> > I'll soon get back to the group with something less abstract -thanks
> again!
> >
> > On Sun, Sep 21, 2014 at 10:06 PM, Ted Dunning <ted.dunn...@gmail.com>
> > wrote:
> >
> > > On Fri, Sep 19, 2014 at 3:29 AM, <mario.al...@gmail.com> wrote:
> > >
> > > > So my question was -shouldn't we consider both the frequency
> > distribution
> > > > of item sales *and* of users purchases in the same formula? Am I
> > correct
> > > if
> > > > I say that this does not happen when we compute the contingency table
> > (if
> > > > we build the contingency table for two users, we do not consider the
> > > > frequency distribution of book sales, and vice versa), right?
> > > >
> > > > That said, I am fully aware that mine is a mainly academic question,
> as
> > > the
> > > > LLR makes anyway a fantastic job....!
> > > >
> > >
> > > As I understand it, I believe that LLR does what your want since it
> knows
> > > the overall frequency of the user and the item in question.
> > >
> > > What is does not do directly is include information about how *other*
> > users
> > > and *other* items are distributed except in aggregate.
> > >
> > > On the other hand, when you rank these LLR scores for a single user,
> you
> > do
> > > incorporate evidence from all other items (relative to that single
> user).
> > >
> > > I think that your point is actually quite subtle and I may have missed
> > the
> > > point.  Can you correct me if so?
> > >
> >
>

Re: LogLikelihoodSimilarity calculation

Reply via email to