On Fri, Jun 21, 2013 at 10:59 AM, Dan Filimon
<[email protected]>wrote:

> Could you be more explicit?
> What models are these, how do I use them to track how similar two items
> are?
>

Luduan document classification.

Recommendation systems.

Adaptive search engines.

The question of how similar items are is much harder to attack than the
question of roughly which items are very similar.  You can deal with the
most related, but in the mid-range even order is very fuzzy.

I'm essentially working with a custom-tailored RowSimilarityJob after
> filtering out users with too many items first.
>

Not that it much matters, I tend to filter out user x item entries based on
the item *and* the user prevalence.  This gives me a nicely bounded number
of occurrences for every user and every item.

If you don't want to count the item frequency in advance, then just
down-sampling crazy users is fine.

The reason that it doesn't much matter is that very few elements are
filtered out.


>
> On Fri, Jun 21, 2013 at 12:35 PM, Ted Dunning <[email protected]>
> wrote:
>
> > Well, you are still stuck with the problem that pulling more bits out of
> > the small count data is a bad idea.
> >
> > Most of the models that I am partial to never even honestly estimate
> > probabilities.  They just include or exclude features and then weight
> rare
> > features higher than common.
> >
> > This is easy to do across days and very easy to have different days
> > contribute differently.
> >
> >
> >
> > On Fri, Jun 21, 2013 at 10:13 AM, Dan Filimon
> > <[email protected]>wrote:
> >
> > > The thing is there's no real model for which these are features.
> > > I'm looking for pairs of similar items (and eventually groups). I'd
> like
> > a
> > > probabilistic interpretation of how similar two items are. Something
> like
> > > "what is the probability that a user that likes one will also like the
> > > other?".
> > >
> > > Then, with these probabilities per day, I'd combine them over the
> course
> > of
> > > multiple days by "pulling" the older probabilities towards 0.5: alpha *
> > 0.5
> > > + (1 - alpha) * p would be the linear approach to combining this where
> > > alpha is 0 for the most recent day and larger for older ones. Then, I'd
> > > take the average of those estimates.
> > > The result would in my mind be a "smoothed" probability.
> > >
> > > Then, I'd get the top k per item from these.
> > >
> > >
> > >
> > > On Fri, Jun 21, 2013 at 11:45 AM, Ted Dunning <[email protected]>
> > > wrote:
> > >
> > > > On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon <
> > > [email protected]
> > > > >wrote:
> > > >
> > > > > Thanks for the reference! I'll take a look at chapter 7, but let me
> > > first
> > > > > describe what I'm trying to achieve.
> > > > >
> > > > > I'm trying to identify interesting pairs, the anomalous
> > co-occurrences
> > > > with
> > > > > the LLR. I'm doing this for a day's data and I want to keep the
> > > p-values.
> > > > > I then want to use the p-values to compute some overall probability
> > > over
> > > > > the course of multiple days to increase confidence in what I think
> > are
> > > > the
> > > > > interesting pairs.
> > > > >
> > > >
> > > > You can't reliably combine p-values this way (repeated comparisons
> and
> > > all
> > > > that).
> > > >
> > > > Also, in practice if you take the top 50-100 indicators of this sort
> > the
> > > > p-values will be so astronomically small that frequentist tests of
> > > > significance are ludicrous.
> > > >
> > > > That said, the assumptions underlying the tests are really a much
> > bigger
> > > > problem.  The interesting problems of the world are often highly
> > > > non-stationary which can lead to all kinds of problems in
> interpreting
> > > > these results.  What does it mean if something shows a 10^-20 p value
> > one
> > > > day and a 0.2 value the next? Are you going to multiply them?  Or
> just
> > > say
> > > > that something isn't quite the same?  But how do you avoid comparing
> > > > p-values in this case which is a famously bad practice.
> > > >
> > > > To my mind, the real problem here is that we are simply asking the
> > wrong
> > > > question.  We shouldn't be asking about individual features.  We
> should
> > > be
> > > > asking about overall model performance.  You *can* measure real-world
> > > > performance and you *can* put error bars around that performance and
> > you
> > > > *can* see changes and degradation in that performance.  All of those
> > > > comparisons are well-founded and work great.  Whether the model has
> > > > selected too many or too few variables really is a diagnostic matter
> > that
> > > > has little to do with answering the question of whether the model is
> > > > working well.
> > > >
> > >
> >
>

Reply via email to