Hi,
I've been looking at log-likelihood ratio to rank statistically significant
co-occurence of items for users, after reading about it first in "Mahout in
Action", then reviewing the code, and links, and Ted Dunning's original paper
on ACL 93'.
I've reproduced Ted's results from the paper, (using the R code in the blog
post), in R:
> m
the swiss OTHER
the 0 110 2442
at 76 0 104
OTHER 2476 111 26458
(This is a 2-line example from the results on page 12 (72 in original
publication) of the paper)
Now,
> to_llr(m)
the swiss OTHER
the 446.26 270.7219 80.975
at 157.21 2.5196 145.986
OTHER 143.11 256.7802 14.745
The results for "the swiss" and "at the" correspond to those appearing in the
paper.
However, note the result for "the the", which accurately presents this odd
bigram as deviating greatly from independence: Given how common "the" is as a
unigram, it's highly improbable to never see it as a bigram.
I am wondering if there's a way to detect whether the deviation from
independence is of the type that the co-occurrance is under-represented or
over-represented w.r.t random sampling. Ideally, I'd like a measure on, say,
(-inf, inf) where if the result is negative there is under-representation of
the class where both A and B occur, and if it is positive, there is an
overabundance of samples with (A intersection B).
My initial guess was that LLR(k_11, k_12, k_21, k_22) has one minima with
respect to k_11, i.e. keeping all other parameters fixed, it will be decreasing
with k_11 up to a point, then increasing. That minimum is obviously when the
co-occurance is random.
Hence my thought was using d_LLR/dk_11 to determine direction: Concretely,
looking at the sign of LLR(k_11 + 1, k_12, k_21, k_22) and using that for the
direction. But I bet there are smarter ways than that.
Ted- on twitter (I am @nimrodpriell) you asked me to e-mail the list and
perhaps your full dissertation has more pointers on the subject. So I would
love it if you can point me to it.
I did note Lingpipe uses a different type of scoring, Pearson C_2 goodness of
fit (it seems different from LLR, but I didn't dig deep) to do their
collocation scoring:
http://alias-i.com/lingpipe/demos/tutorial/interestingPhrases/read-me.html (the
exact method is documented in the code,
http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/TokenizedLM.html#chiSquaredIndependence(int[])
). Is that method a good way to capture what I'd like?
---
On a completely different subject: I wrote a simple RelevantItemsDataSplitter
and RecommenderIRStatsEvaluator which take a list of item IDs, and run CF
evaluation by hiding items only out of that list, and asking to recommend only
out of that list of items (precision and recall are then also calculated only
with that list of items as the universe).
The use-case is when you have an entire user-item matrix (think binary CF, not
rating-based for now) but only certain items are interesting in an actual
recommendation scenario: For instance, you only want to recommend products from
a specific store, but you think that knowledge of the products from other
stores would help (hence the bigger matrix). It helps because recommending just
by popularity is very good and could beat user-item or item-item CF in cases
where there is a power-law distribution and most the users have a small set of
items. But in a real situation you often don't want to recommend what 80% of
the users already have (even if a user is in the 20% that doesn't have that
item). You want results on items you're likely to recommend in a real scenario;
That are less popular for example.
I realize an alternative to the example I proposed with the popularity is
looking at the top-n recommendation for large n because only relatively few
items are very popular so the precision-recall stats based on popularity become
less skewed; But I still think it's a useful constraint for evaluation.
If there is interest I will make sure it is well tested, read the guidelines
and try to contribute the code (it's really <5 lines of code that change). Or
was there a better way to do this rather than overloading functions on the RIDS
and RIRSE classes?
Thanks,
Nimrod Priell