On Tue, Feb 5, 2013 at 11:29 AM, Pat Ferrel <[email protected]> wrote:
> I think you meant: "Human relatedness decays much slower than item > popularity." > Yes. Oops. > So to make sure I understand the implications of using IDF… For > boolean/implicit preferences the sum of all prefs (after weighting) for a > single item over all users will always be 1 or 0. This no matter whether > the frequency is 1M or 1. > I don't see this. For things that occur once for N users, the sum is log N. For items that occur for every user, the sum will be 0. Another approach would be to do some kind of outlier detection and remove > those users. Down-sampling and proper thresholding handles this. Crazy users and crawlers are relatively rare and each get only a single vote. This makes them immaterial. Looking at some types of web data you will see crawlers as outliers mucking > up impression or click-thru data. > You will see them, but they shouldn't matter. > > On Feb 2, 2013, at 1:25 PM, Ted Dunning <[email protected]> wrote: > > On Sat, Feb 2, 2013 at 1:03 PM, Pat Ferrel <[email protected]> wrote: > > > Indeed, please elaborate. Not sure what you mean by "this is an important > > effect" > > > > Do you disagree with what I said re temporal decay? > > > > No. I agree with it. Human relatedness decays much more quickly than item > popularity. > > I was extending this. Down-sampling should make use of this observation to > try to preserve time coincidence in the resulting dataset. > > > > As to downsampling or rather reweighting outliers in popular items and/or > > active users--It's another interesting question. Does the fact that we > both > > like puppies and motherhood make us in any real way similar? I'm quite > > interested in ways to account for this. I've seen what is done to > normalize > > ratings from different users based on whether they tend to rate high or > > low. I'm interested in any papers talking about the super active user or > > super popular items. > > > > I view downsampling as a necessary evil when using cooccurrence based > algorithms. This only applies to prolific users. > > For items, I tend to use simple IDF weightings. This gives very low > weights to ubiquitous preferences. > > > > > > > Another subject of interest is the question; is it possible to create a > > blend of recommenders based on their performance on long tail items. > > > Absolutely this is possible and it is a great thing to do. Ensembles are > all the fashion rage and for good reason. See all the top players in the > Netflix challenge. > > > > For instance if the precision of a recommender (just considering the > > item-item similarity for the present) as a function of item popularity > > decreases towards the long tail, is it possible that one type of > > recommender does better than another--do the distributions cross? This > > would suggest a blending strategy based on how far out the long tail you > > are when calculating similar items. > > > Yeah... but you can't tell very well due to the low counts. > >
