Hi Ted & Pat,
Thank you so much for your answers. I can see now why it makes sense that
in my first case (intersection size 90), LLR returns zero.
But with the second case i still don't understand one thing. To restate the
problem,
I have the following case where numItems = 1,000,000, prefs1Size =
900,000, prefs2Size
= 100 and intersection size is 10.
The matrix that Ted generated is
[,1] [,2]
[1,] 10 899990
[2,] 90 99910
The LLR for this case is 351.6270674569532. I understand this.
But if the users had so little in common, is this not dissimilarity?
In Mahout today i am getting a score of 0.9971.
If however the intersection size is 20, then the values for are 272.6 for
LLR and mahout's similarity score is 0.9963.
I get how the LLR should decrease. But is an intersection of 20 not more
similar than an intersection of 10.
Sorry if i am missing something very obvious.
Thanks,
Aishwarya
On Tue, Sep 23, 2014 at 3:03 AM, <[email protected]> wrote:
> > Can you correct me if so?
>
> This makes me doubt about the correctness of my point:) The best is
> possibly to write down a small example, with numbers and formulae. It's the
> best way either to see if I missed the point, or there is actually a subtle
> truth....
>
> I'll soon get back to the group with something less abstract -thanks again!
>
> On Sun, Sep 21, 2014 at 10:06 PM, Ted Dunning <[email protected]>
> wrote:
>
> > On Fri, Sep 19, 2014 at 3:29 AM, <[email protected]> wrote:
> >
> > > So my question was -shouldn't we consider both the frequency
> distribution
> > > of item sales *and* of users purchases in the same formula? Am I
> correct
> > if
> > > I say that this does not happen when we compute the contingency table
> (if
> > > we build the contingency table for two users, we do not consider the
> > > frequency distribution of book sales, and vice versa), right?
> > >
> > > That said, I am fully aware that mine is a mainly academic question, as
> > the
> > > LLR makes anyway a fantastic job....!
> > >
> >
> > As I understand it, I believe that LLR does what your want since it knows
> > the overall frequency of the user and the item in question.
> >
> > What is does not do directly is include information about how *other*
> users
> > and *other* items are distributed except in aggregate.
> >
> > On the other hand, when you rank these LLR scores for a single user, you
> do
> > incorporate evidence from all other items (relative to that single user).
> >
> > I think that your point is actually quite subtle and I may have missed
> the
> > point. Can you correct me if so?
> >
>