Is this not just a matter of comparing the frequency of "the" with
"the the"? If "the" is 1/n of the words, then "the the" ought to be
1/n^2. If it's less, it's under-represented.

On Thu, Jun 21, 2012 at 9:01 PM, Nimrod Priell <[email protected]> wrote:
> I am wondering if there's a way to detect whether the deviation from 
> independence is of the type that the co-occurrance is under-represented or 
> over-represented w.r.t random sampling. Ideally, I'd like a measure on, say, 
> (-inf, inf) where if the result is negative there is under-representation of 
> the class where both A and B occur, and if it is positive, there is an 
> overabundance of samples with (A intersection B).
>
> My initial guess was that LLR(k_11, k_12, k_21, k_22) has one minima with 
> respect to k_11, i.e. keeping all other parameters fixed, it will be 
> decreasing with k_11 up to a point, then increasing. That minimum is 
> obviously when the co-occurance is random.

Reply via email to