Thanks for the reference! I'll take a look at chapter 7, but let me first
describe what I'm trying to achieve.
I'm trying to identify interesting pairs, the anomalous co-occurrences with
the LLR. I'm doing this for a day's data and I want to keep the p-values.
I then want to use the p-values to
On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon dangeorge.fili...@gmail.comwrote:
Thanks for the reference! I'll take a look at chapter 7, but let me first
describe what I'm trying to achieve.
I'm trying to identify interesting pairs, the anomalous co-occurrences with
the LLR. I'm doing this
The thing is there's no real model for which these are features.
I'm looking for pairs of similar items (and eventually groups). I'd like a
probabilistic interpretation of how similar two items are. Something like
what is the probability that a user that likes one will also like the
other?.
Then,
Well, you are still stuck with the problem that pulling more bits out of
the small count data is a bad idea.
Most of the models that I am partial to never even honestly estimate
probabilities. They just include or exclude features and then weight rare
features higher than common.
This is easy
Could you be more explicit?
What models are these, how do I use them to track how similar two items are?
I'm essentially working with a custom-tailored RowSimilarityJob after
filtering out users with too many items first.
On Fri, Jun 21, 2013 at 12:35 PM, Ted Dunning ted.dunn...@gmail.com
On Fri, Jun 21, 2013 at 10:59 AM, Dan Filimon
dangeorge.fili...@gmail.comwrote:
Could you be more explicit?
What models are these, how do I use them to track how similar two items
are?
Luduan document classification.
Recommendation systems.
Adaptive search engines.
The question of how
Not that it much matters, I tend to filter out user x item entries based on
the item *and* the user prevalence. This gives me a nicely bounded number
of occurrences for every user and every item.
I'd be interested in implementing this. Can you share a few more
details? Having another pass
See https://github.com/tdunning/in-memory-cooccurrence for an in-memory
implementation.
Should just require three or so lines of code.
On Fri, Jun 21, 2013 at 11:23 AM, Sebastian Schelter s...@apache.org wrote:
Not that it much matters, I tend to filter out user x item entries based
on
When computing item-item similarity using the log-likelihood similarity
[1], can I simply apply a sigmoid do the resulting values to get the
probability that two items are similar?
Is there any other processing I need to do?
Thanks!
[1]
someone can check my facts here, but the log-likelihood ratio follows
a chi-square distribution. You can figure an actual probability from
that in the usual way, from its CDF. You would need to tweak the code
you see in the project to compute an actual LLR by normalizing the
input.
You could use
My understanding:
Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared
distribution with 1 degree of freedom in the 2x2 table case.
A ~A
B
~B
We're testing to see if p(A | B) = p(A | ~B). That's the null hypothesis. I
compute the LLR. The larger that is, the more unlikely
I think the quickest answer is: the formula computes the test
statistic as a difference of log values, rather than log of ratio of
values. By not normalizing, the entropy is multiplied by a factor (sum
of the counts) vs normalized. So you do end up with a statistic N
times larger when counts are N
Right, makes sense. So, by normalize, I need to replace the counts in the
matrix with probabilities.
So, I would divide everything by the sum of all the counts in the matrix?
On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen sro...@gmail.com wrote:
I think the quickest answer is: the formula
Yes that should be all that's needed.
On Jun 20, 2013 10:27 AM, Dan Filimon dangeorge.fili...@gmail.com wrote:
Right, makes sense. So, by normalize, I need to replace the counts in the
matrix with probabilities.
So, I would divide everything by the sum of all the counts in the matrix?
On
Awesome! Thanks for clarifying! :)
On Thu, Jun 20, 2013 at 12:28 PM, Sean Owen sro...@gmail.com wrote:
Yes that should be all that's needed.
On Jun 20, 2013 10:27 AM, Dan Filimon dangeorge.fili...@gmail.com
wrote:
Right, makes sense. So, by normalize, I need to replace the counts in the
I think that this is a really bad thing to do.
The LLR is really good to find interesting things. Once you have done
that, directly using the LLR in any form to produce a weight reduces the
method to something akin to Naive Bayes. This is bad generally and very,
very bad in the cases of smal
16 matches
Mail list logo