Re: Using IDF in CF recommender

Ted Dunning Tue, 05 Feb 2013 12:34:37 -0800

On Tue, Feb 5, 2013 at 11:29 AM, Pat Ferrel <[email protected]> wrote:


> I think you meant: "Human relatedness decays much slower than item
> popularity."
>

Yes.  Oops.


> So to make sure I understand the implications of using IDF…  For
> boolean/implicit preferences the sum of all prefs (after weighting) for a
> single item over all users will always be 1 or 0. This no matter whether
> the frequency is 1M or 1.
>

I don't see this.

For things that occur once for N users, the sum is log N.  For items that
occur for every user, the sum will be 0.

Another approach would be to do some kind of outlier detection and remove
> those users.


Down-sampling and proper thresholding handles this.   Crazy users and
crawlers are relatively rare and each get only a single vote.  This makes
them immaterial.

Looking at some types of web data you will see crawlers as outliers mucking
> up impression or click-thru data.
>

You will see them, but they shouldn't matter.


>
> On Feb 2, 2013, at 1:25 PM, Ted Dunning <[email protected]> wrote:
>
> On Sat, Feb 2, 2013 at 1:03 PM, Pat Ferrel <[email protected]> wrote:
>
> > Indeed, please elaborate. Not sure what you mean by "this is an important
> > effect"
> >
> > Do you disagree with what I said re temporal decay?
> >
>
> No.  I agree with it.  Human relatedness decays much more quickly than item
> popularity.
>
> I was extending this.  Down-sampling should make use of this observation to
> try to preserve time coincidence in the resulting dataset.
>
>
> > As to downsampling or rather reweighting outliers in popular items and/or
> > active users--It's another interesting question. Does the fact that we
> both
> > like puppies and motherhood make us in any real way similar? I'm quite
> > interested in ways to account for this. I've seen what is done to
> normalize
> > ratings from different users based on whether they tend to rate high or
> > low. I'm interested in any papers talking about the super active user or
> > super popular items.
> >
>
> I view downsampling as a necessary evil when using cooccurrence based
> algorithms.  This only applies to prolific users.
>
> For items, I tend to use simple IDF weightings.  This gives very low
> weights to ubiquitous preferences.
>
>
>
> >
> > Another subject of interest is the question; is it possible to create a
> > blend of recommenders based on their performance on long tail items.
>
>
> Absolutely this is possible and it is a great thing to do.  Ensembles are
> all the fashion rage and for good reason.  See all the top players in the
> Netflix challenge.
>
>
> > For instance if the precision of a recommender (just considering the
> > item-item similarity for the present) as a function of item popularity
> > decreases towards the long tail, is it possible that one type of
> > recommender does better than another--do the distributions cross? This
> > would suggest a blending strategy based on how far out the long tail you
> > are when calculating similar items.
>
>
> Yeah... but you can't tell very well due to the low counts.
>
>

Re: Using IDF in CF recommender

Reply via email to