Re: Log-likelihood ratio test as a probability

2013-06-21 Thread Dan Filimon
Thanks for the reference! I'll take a look at chapter 7, but let me first describe what I'm trying to achieve. I'm trying to identify interesting pairs, the anomalous co-occurrences with the LLR. I'm doing this for a day's data and I want to keep the p-values. I then want to use the p-values to

Re: Log-likelihood ratio test as a probability

2013-06-21 Thread Ted Dunning
On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon dangeorge.fili...@gmail.comwrote: Thanks for the reference! I'll take a look at chapter 7, but let me first describe what I'm trying to achieve. I'm trying to identify interesting pairs, the anomalous co-occurrences with the LLR. I'm doing this

Re: Log-likelihood ratio test as a probability

2013-06-21 Thread Dan Filimon
The thing is there's no real model for which these are features. I'm looking for pairs of similar items (and eventually groups). I'd like a probabilistic interpretation of how similar two items are. Something like what is the probability that a user that likes one will also like the other?. Then,

Re: Log-likelihood ratio test as a probability

2013-06-21 Thread Ted Dunning
Well, you are still stuck with the problem that pulling more bits out of the small count data is a bad idea. Most of the models that I am partial to never even honestly estimate probabilities. They just include or exclude features and then weight rare features higher than common. This is easy

Re: Log-likelihood ratio test as a probability

2013-06-21 Thread Dan Filimon
Could you be more explicit? What models are these, how do I use them to track how similar two items are? I'm essentially working with a custom-tailored RowSimilarityJob after filtering out users with too many items first. On Fri, Jun 21, 2013 at 12:35 PM, Ted Dunning ted.dunn...@gmail.com

Re: Log-likelihood ratio test as a probability

2013-06-21 Thread Ted Dunning
On Fri, Jun 21, 2013 at 10:59 AM, Dan Filimon dangeorge.fili...@gmail.comwrote: Could you be more explicit? What models are these, how do I use them to track how similar two items are? Luduan document classification. Recommendation systems. Adaptive search engines. The question of how

Re: Log-likelihood ratio test as a probability

2013-06-21 Thread Sebastian Schelter
Not that it much matters, I tend to filter out user x item entries based on the item *and* the user prevalence. This gives me a nicely bounded number of occurrences for every user and every item. I'd be interested in implementing this. Can you share a few more details? Having another pass

Re: Log-likelihood ratio test as a probability

2013-06-21 Thread Ted Dunning
See https://github.com/tdunning/in-memory-cooccurrence for an in-memory implementation. Should just require three or so lines of code. On Fri, Jun 21, 2013 at 11:23 AM, Sebastian Schelter s...@apache.org wrote: Not that it much matters, I tend to filter out user x item entries based on

Log-likelihood ratio test as a probability

2013-06-20 Thread Dan Filimon
When computing item-item similarity using the log-likelihood similarity [1], can I simply apply a sigmoid do the resulting values to get the probability that two items are similar? Is there any other processing I need to do? Thanks! [1]

Re: Log-likelihood ratio test as a probability

2013-06-20 Thread Sean Owen
someone can check my facts here, but the log-likelihood ratio follows a chi-square distribution. You can figure an actual probability from that in the usual way, from its CDF. You would need to tweak the code you see in the project to compute an actual LLR by normalizing the input. You could use

Re: Log-likelihood ratio test as a probability

2013-06-20 Thread Dan Filimon
My understanding: Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared distribution with 1 degree of freedom in the 2x2 table case. A ~A B ~B We're testing to see if p(A | B) = p(A | ~B). That's the null hypothesis. I compute the LLR. The larger that is, the more unlikely

Re: Log-likelihood ratio test as a probability

2013-06-20 Thread Sean Owen
I think the quickest answer is: the formula computes the test statistic as a difference of log values, rather than log of ratio of values. By not normalizing, the entropy is multiplied by a factor (sum of the counts) vs normalized. So you do end up with a statistic N times larger when counts are N

Re: Log-likelihood ratio test as a probability

2013-06-20 Thread Dan Filimon
Right, makes sense. So, by normalize, I need to replace the counts in the matrix with probabilities. So, I would divide everything by the sum of all the counts in the matrix? On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen sro...@gmail.com wrote: I think the quickest answer is: the formula

Re: Log-likelihood ratio test as a probability

2013-06-20 Thread Sean Owen
Yes that should be all that's needed. On Jun 20, 2013 10:27 AM, Dan Filimon dangeorge.fili...@gmail.com wrote: Right, makes sense. So, by normalize, I need to replace the counts in the matrix with probabilities. So, I would divide everything by the sum of all the counts in the matrix? On

Re: Log-likelihood ratio test as a probability

2013-06-20 Thread Dan Filimon
Awesome! Thanks for clarifying! :) On Thu, Jun 20, 2013 at 12:28 PM, Sean Owen sro...@gmail.com wrote: Yes that should be all that's needed. On Jun 20, 2013 10:27 AM, Dan Filimon dangeorge.fili...@gmail.com wrote: Right, makes sense. So, by normalize, I need to replace the counts in the

Re: Log-likelihood ratio test as a probability

2013-06-20 Thread Ted Dunning
I think that this is a really bad thing to do. The LLR is really good to find interesting things. Once you have done that, directly using the LLR in any form to produce a weight reduces the method to something akin to Naive Bayes. This is bad generally and very, very bad in the cases of smal