Thanks for the reference! I'll take a look at chapter 7, but let me first
describe what I'm trying to achieve.

I'm trying to identify interesting pairs, the anomalous co-occurrences with
the LLR. I'm doing this for a day's data and I want to keep the p-values.
I then want to use the p-values to compute some overall probability over
the course of multiple days to increase confidence in what I think are the
interesting pairs.



On Fri, Jun 21, 2013 at 1:10 AM, Ted Dunning <[email protected]> wrote:

> I think that this is a really bad thing to do.
>
> The LLR is really good to find interesting things.  Once you have done
> that, directly using the LLR in any form to produce a weight reduces the
> method to something akin to Naive Bayes.  This is bad generally and very,
> very bad in the cases of smal counts.
>
> Typically LLR works extremely well when you use it as a filter only and
> then use som global measure to compute a weight.  See the Luduan method [1]
> for an example.  The use of a text retrieval engine to implement a search
> engine such as I have been lately nattering about much too much is another
> example.    A major reason that such methods work so unreasonably well is
> that they don't make silly weighting decisions based on very small counts.
>  It is slightly paradoxical that looking at global counts rather than
> counts specific so the cases of interest produce much better weights, but
> the empirical evidence is pretty over-whelming.
>
> Aside from such practical considerations, there is the fact that converting
> a massive number of frequentist p values into weight is either outright
> heresy (from the frequentist point of view) or simply nutty (from the
> Bayesian point of view).
>
> In any case, I have never been able get more than one bit of useful
> information from an LLR score.  That one bit is extremely powerful, but
> getting more seems to be a very bad idea.
>
>
> [1] http://arxiv.org/abs/1207.1847 chapter 7, espoecially
>
>
> On Thu, Jun 20, 2013 at 10:41 AM, Dan Filimon
> <[email protected]>wrote:
>
> > Awesome! Thanks for clarifying! :)
> >
> >
> > On Thu, Jun 20, 2013 at 12:28 PM, Sean Owen <[email protected]> wrote:
> >
> > > Yes that should be all that's needed.
> > > On Jun 20, 2013 10:27 AM, "Dan Filimon" <[email protected]>
> > > wrote:
> > >
> > > > Right, makes sense. So, by normalize, I need to replace the counts in
> > the
> > > > matrix with probabilities.
> > > > So, I would divide everything by the sum of all the counts in the
> > matrix?
> > > >
> > > >
> > > > On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen <[email protected]>
> wrote:
> > > >
> > > > > I think the quickest answer is: the formula computes the test
> > > > > statistic as a difference of log values, rather than log of ratio
> of
> > > > > values. By not normalizing, the entropy is multiplied by a factor
> > (sum
> > > > > of the counts) vs normalized. So you do end up with a statistic N
> > > > > times larger when counts are N times larger.
> > > > >
> > > > > On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon
> > > > > <[email protected]> wrote:
> > > > > > My understanding:
> > > > > >
> > > > > > Yes, the log-likelihood ratio (-2 log lambda) follows a
> chi-squared
> > > > > > distribution with 1 degree of freedom in the 2x2 table case.
> > > > > >       A   ~A
> > > > > > B
> > > > > > ~B
> > > > > >
> > > > > > We're testing to see if p(A | B) = p(A | ~B). That's the null
> > > > > hypothesis. I
> > > > > > compute the LLR. The larger that is, the more unlikely the null
> > > > > hypothesis
> > > > > > is to be true.
> > > > > > I can then look at a table with df=1. And I'd get p, the
> > probability
> > > of
> > > > > > seeing that result or something worse (the upper tail).
> > > > > > So, the probability of them being similar is 1 - p (which is
> > exactly
> > > > the
> > > > > > CDF for that value of X).
> > > > > >
> > > > > > Now, my question is: in the contingency table case, why would I
> > > > > normalize?
> > > > > > It's a ratio already, isn't it?
> > > > > >
> > > > > >
> > > > > > On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen <[email protected]>
> > > wrote:
> > > > > >
> > > > > >> someone can check my facts here, but the log-likelihood ratio
> > > follows
> > > > > >> a chi-square distribution. You can figure an actual probability
> > from
> > > > > >> that in the usual way, from its CDF. You would need to tweak the
> > > code
> > > > > >> you see in the project to compute an actual LLR by normalizing
> the
> > > > > >> input.
> > > > > >>
> > > > > >> You could use 1-p then as a similarity metric.
> > > > > >>
> > > > > >> This also isn't how the test statistic is turned into a
> similarity
> > > > > >> metric in the project now. But 1-p sounds nicer. Maybe the
> > > historical
> > > > > >> reason was speed, or, ignorance.
> > > > > >>
> > > > > >> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon
> > > > > >> <[email protected]> wrote:
> > > > > >> > When computing item-item similarity using the log-likelihood
> > > > > similarity
> > > > > >> > [1], can I simply apply a sigmoid do the resulting values to
> get
> > > the
> > > > > >> > probability that two items are similar?
> > > > > >> >
> > > > > >> > Is there any other processing I need to do?
> > > > > >> >
> > > > > >> > Thanks!
> > > > > >> >
> > > > > >> > [1]
> > > > http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Reply via email to