oops, forgot the log
So...
idf weighted preference value = item preference value * log (number of all
items/number of users with specific item pref)
items
1 0 0
users 1 0 0
1 1 0
freq 3 1 0
#users/freq 3/3 3/1 0
So the idf weighted values
1*log(1) 0 0
1*log(1) 0 0
1*log(1) 1*log(3) 0
sum 0 log(3) 0
so the IDF weighted matrix is
items
0 0 0
users 0 0 0
0 0.48 0
This results in no information for universally preferred items, which is indeed
what I was looking for. It looks like this should also work for other values or
explicit preferences--item prices, ratings, etc..
Intuition says this will result in a lower precision related cross validation
measure since you are discounting the obvious recommendations. I have no
experience with measuring something like this, any you have would be
appreciated.
On Feb 5, 2013, at 12:33 PM, Ted Dunning <[email protected]> wrote:
On Tue, Feb 5, 2013 at 11:29 AM, Pat Ferrel <[email protected]> wrote:
> I think you meant: "Human relatedness decays much slower than item
> popularity."
>
Yes. Oops.
> So to make sure I understand the implications of using IDF… For
> boolean/implicit preferences the sum of all prefs (after weighting) for a
> single item over all users will always be 1 or 0. This no matter whether
> the frequency is 1M or 1.
>
I don't see this.
For things that occur once for N users, the sum is log N. For items that
occur for every user, the sum will be 0.
Another approach would be to do some kind of outlier detection and remove
> those users.
Down-sampling and proper thresholding handles this. Crazy users and
crawlers are relatively rare and each get only a single vote. This makes
them immaterial.
Looking at some types of web data you will see crawlers as outliers mucking
> up impression or click-thru data.
>
You will see them, but they shouldn't matter.
>
> On Feb 2, 2013, at 1:25 PM, Ted Dunning <[email protected]> wrote:
>
> On Sat, Feb 2, 2013 at 1:03 PM, Pat Ferrel <[email protected]> wrote:
>
>> Indeed, please elaborate. Not sure what you mean by "this is an important
>> effect"
>>
>> Do you disagree with what I said re temporal decay?
>>
>
> No. I agree with it. Human relatedness decays much more quickly than item
> popularity.
>
> I was extending this. Down-sampling should make use of this observation to
> try to preserve time coincidence in the resulting dataset.
>
>
>> As to downsampling or rather reweighting outliers in popular items and/or
>> active users--It's another interesting question. Does the fact that we
> both
>> like puppies and motherhood make us in any real way similar? I'm quite
>> interested in ways to account for this. I've seen what is done to
> normalize
>> ratings from different users based on whether they tend to rate high or
>> low. I'm interested in any papers talking about the super active user or
>> super popular items.
>>
>
> I view downsampling as a necessary evil when using cooccurrence based
> algorithms. This only applies to prolific users.
>
> For items, I tend to use simple IDF weightings. This gives very low
> weights to ubiquitous preferences.
>
>
>
>>
>> Another subject of interest is the question; is it possible to create a
>> blend of recommenders based on their performance on long tail items.
>
>
> Absolutely this is possible and it is a great thing to do. Ensembles are
> all the fashion rage and for good reason. See all the top players in the
> Netflix challenge.
>
>
>> For instance if the precision of a recommender (just considering the
>> item-item similarity for the present) as a function of item popularity
>> decreases towards the long tail, is it possible that one type of
>> recommender does better than another--do the distributions cross? This
>> would suggest a blending strategy based on how far out the long tail you
>> are when calculating similar items.
>
>
> Yeah... but you can't tell very well due to the low counts.
>
>