Hi,

I've been looking at log-likelihood ratio to rank statistically significant 
co-occurence of items for users, after reading about it first in "Mahout in 
Action", then reviewing the code, and links, and Ted Dunning's original paper 
on ACL 93'.

I've reproduced Ted's results from the paper, (using the R code in the blog 
post), in R:
> m
       the swiss OTHER
the      0   110  2442
at      76     0   104
OTHER 2476   111 26458

(This is a 2-line example from the results on page 12 (72 in original 
publication) of the paper)
Now, 

> to_llr(m)
         the    swiss   OTHER
the   446.26 270.7219  80.975
at    157.21   2.5196 145.986
OTHER 143.11 256.7802  14.745

The results for "the swiss" and "at the" correspond to those appearing in the 
paper.

However, note the result for "the the", which accurately presents this odd 
bigram as deviating greatly from independence: Given how common "the" is as a 
unigram, it's highly improbable to never see it as a bigram.

I am wondering if there's a way to detect whether the deviation from 
independence is of the type that the co-occurrance is under-represented or 
over-represented w.r.t random sampling. Ideally, I'd like a measure on, say, 
(-inf, inf) where if the result is negative there is under-representation of 
the class where both A and B occur, and if it is positive, there is an 
overabundance of samples with (A intersection B). 

My initial guess was that LLR(k_11, k_12, k_21, k_22) has one minima with 
respect to k_11, i.e. keeping all other parameters fixed, it will be decreasing 
with k_11 up to a point, then increasing. That minimum is obviously when the 
co-occurance is random.

Hence my thought was using d_LLR/dk_11 to determine direction: Concretely, 
looking at the sign of LLR(k_11 + 1, k_12, k_21, k_22) and using that for the 
direction. But I bet there are smarter ways than that.

Ted- on twitter (I am @nimrodpriell) you asked me to e-mail the list and 
perhaps your full dissertation has more pointers on the subject. So I would 
love it if you can point me to it. 

I did note Lingpipe uses a different type of scoring, Pearson C_2 goodness of 
fit (it seems different from LLR, but I didn't dig deep) to do their 
collocation scoring: 
http://alias-i.com/lingpipe/demos/tutorial/interestingPhrases/read-me.html (the 
exact method is documented in the code, 
http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/TokenizedLM.html#chiSquaredIndependence(int[])
 ). Is that method a good way to capture what I'd like?

---

On a completely different subject: I wrote a simple RelevantItemsDataSplitter 
and RecommenderIRStatsEvaluator which take a list of item IDs, and run CF 
evaluation by hiding items only out of that list, and asking to recommend only 
out of that list of items (precision and recall are then also calculated only 
with that list of items as the universe).

The use-case is when you have an entire user-item matrix (think binary CF, not 
rating-based for now) but only certain items are interesting in an actual 
recommendation scenario: For instance, you only want to recommend products from 
a specific store, but you think that knowledge of the products from other 
stores would help (hence the bigger matrix). It helps because recommending just 
by popularity is very good and could beat user-item or item-item CF in cases 
where there is a power-law distribution and most the users have a small set of 
items. But in a real situation you often don't want to recommend what 80% of 
the users already have (even if a user is in the 20% that doesn't have that 
item). You want results on items you're likely to recommend in a real scenario; 
That are less popular for example.

I realize an alternative to the example I proposed with the popularity is 
looking at the top-n recommendation for large n because only relatively few 
items are very popular so the precision-recall stats based on popularity become 
less skewed; But I still think it's a useful constraint for evaluation. 

If there is interest I will make sure it is well tested, read the guidelines 
and try to contribute the code (it's really <5 lines of code that change). Or 
was there a better way to do this rather than overloading functions on the RIDS 
and RIRSE classes?

Thanks,
Nimrod Priell

Reply via email to