Is this not just a matter of comparing the frequency of "the" with "the the"? If "the" is 1/n of the words, then "the the" ought to be 1/n^2. If it's less, it's under-represented.
On Thu, Jun 21, 2012 at 9:01 PM, Nimrod Priell <[email protected]> wrote: > I am wondering if there's a way to detect whether the deviation from > independence is of the type that the co-occurrance is under-represented or > over-represented w.r.t random sampling. Ideally, I'd like a measure on, say, > (-inf, inf) where if the result is negative there is under-representation of > the class where both A and B occur, and if it is positive, there is an > overabundance of samples with (A intersection B). > > My initial guess was that LLR(k_11, k_12, k_21, k_22) has one minima with > respect to k_11, i.e. keeping all other parameters fixed, it will be > decreasing with k_11 up to a point, then increasing. That minimum is > obviously when the co-occurance is random.
