Awesome! Thanks for clarifying! :)
On Thu, Jun 20, 2013 at 12:28 PM, Sean Owen <[email protected]> wrote: > Yes that should be all that's needed. > On Jun 20, 2013 10:27 AM, "Dan Filimon" <[email protected]> > wrote: > > > Right, makes sense. So, by normalize, I need to replace the counts in the > > matrix with probabilities. > > So, I would divide everything by the sum of all the counts in the matrix? > > > > > > On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen <[email protected]> wrote: > > > > > I think the quickest answer is: the formula computes the test > > > statistic as a difference of log values, rather than log of ratio of > > > values. By not normalizing, the entropy is multiplied by a factor (sum > > > of the counts) vs normalized. So you do end up with a statistic N > > > times larger when counts are N times larger. > > > > > > On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon > > > <[email protected]> wrote: > > > > My understanding: > > > > > > > > Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared > > > > distribution with 1 degree of freedom in the 2x2 table case. > > > > A ~A > > > > B > > > > ~B > > > > > > > > We're testing to see if p(A | B) = p(A | ~B). That's the null > > > hypothesis. I > > > > compute the LLR. The larger that is, the more unlikely the null > > > hypothesis > > > > is to be true. > > > > I can then look at a table with df=1. And I'd get p, the probability > of > > > > seeing that result or something worse (the upper tail). > > > > So, the probability of them being similar is 1 - p (which is exactly > > the > > > > CDF for that value of X). > > > > > > > > Now, my question is: in the contingency table case, why would I > > > normalize? > > > > It's a ratio already, isn't it? > > > > > > > > > > > > On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen <[email protected]> > wrote: > > > > > > > >> someone can check my facts here, but the log-likelihood ratio > follows > > > >> a chi-square distribution. You can figure an actual probability from > > > >> that in the usual way, from its CDF. You would need to tweak the > code > > > >> you see in the project to compute an actual LLR by normalizing the > > > >> input. > > > >> > > > >> You could use 1-p then as a similarity metric. > > > >> > > > >> This also isn't how the test statistic is turned into a similarity > > > >> metric in the project now. But 1-p sounds nicer. Maybe the > historical > > > >> reason was speed, or, ignorance. > > > >> > > > >> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon > > > >> <[email protected]> wrote: > > > >> > When computing item-item similarity using the log-likelihood > > > similarity > > > >> > [1], can I simply apply a sigmoid do the resulting values to get > the > > > >> > probability that two items are similar? > > > >> > > > > >> > Is there any other processing I need to do? > > > >> > > > > >> > Thanks! > > > >> > > > > >> > [1] > > http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html > > > >> > > > > > >
