I think, it is better to choose ratings of the test user in a random fashion.

On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen <[email protected]> wrote:
> Yes. But: the test sample is small. Using 40% of your data to test is
> probably quite too much.
>
> My point is that it may be the least-bad thing to do. What test are you
> proposing instead, and why is it coherent with what you're testing?
>
>
>
>
> On Sat, Feb 16, 2013 at 8:26 PM, Ahmet Ylmaz 
> <[email protected]>wrote:
>
>> But modeling a user only by his/her low ratings can be problematic since
>> people generally are more precise (I believe) in their high ratings.
>> Another problem is that recommender algorithms in general first mean
>> normalize the ratings for each user. Suppose that we have the following
>> ratings of 3 people (A, B, and C) on 5 items.
>>
>> A's ratings: 1 2 3 4 5
>> B's ratings: 1 3 5 2 4
>> C's ratings: 1 2 3 4 5
>>
>>
>> Suppose that A is the test user. Now if we put only the low ratings of A
>> (1, 2, and 3) into the training set and mean normalize the ratings then A
>> will be
>> more similar to B than C, which is not true.
>>
>>
>>
>>
>> ________________________________
>>  From: Sean Owen <[email protected]>
>> To: Mahout User List <[email protected]>; Ahmet Ylmaz <
>> [email protected]>
>> Sent: Saturday, February 16, 2013 8:41 PM
>> Subject: Re: Problems with Mahout's RecommenderIRStatsEvaluator
>>
>> No, this is not a problem.
>>
>> Yes it builds a model for each user, which takes a long time. It's
>> accurate, but time-consuming. It's meant for small data. You could rewrite
>> your own test to hold out data for all test users at once. That's what I
>> did when I rewrote a lot of this just because it was more useful to have
>> larger tests.
>>
>> There are several ways to choose the test data. One common way is by time,
>> but there is no time information here by default. The problem is that, for
>> example, recent ratings may be low -- or at least not high ratings. But the
>> evaluation is of course asking the recommender for items that are predicted
>> to be highly rated. Random selection has the same problem. Choosing by
>> rating at least makes the test coherent.
>>
>> It does bias the training set, but, the test set is supposed to be small.
>>
>> There is no way to actually know, a priori, what the top recommendations
>> are. You have no information to evaluate most recommendations. This makes a
>> precision/recall test fairly uninformative in practice. Still, it's better
>> than nothing and commonly understood.
>>
>> While precision/recall won't be high on tests like this, because of this, I
>> don't get these values for movielens data on any normal algo, but, you may
>> be, if choosing an algorithm or parameters that don't work well.
>>
>>
>>
>>
>> On Sat, Feb 16, 2013 at 7:30 PM, Ahmet Ylmaz <[email protected]
>> >wrote:
>>
>> > Hi,
>> >
>> > I have looked at the internals of Mahout's RecommenderIRStatsEvaluator
>> > code. I think that there are two important problems here.
>> >
>> > According to my understanding the experimental protocol used in this code
>> > is something like this:
>> >
>> > It takes away a certain percentage of users as test users.
>> > For
>> >  each test user it builds a training set consisting of ratings given by
>> > all other users + the ratings of the test user which are below the
>> > relevanceThreshold.
>> > It then builds a model and makes a
>> > recommendation to the test user and finds the intersection between this
>> > recommendation list and the items which are rated above the
>> > relevanceThreshold by the test user.
>> > It then calculates the precision and recall in the usual way.
>> >
>> > Probems:
>> > 1. (mild) It builds a model for every test user which can take a lot of
>> > time.
>> >
>> > 2. (severe) Only the ratings (of the test user) which are below the
>> > relevanceThreshold are put into the training set. This means that the
>> > algorithm
>> > only knows the preferences of the test user about the items which s/he
>> > don't like. This is not a good representation of user ratings.
>> >
>> > Moreover when I run this evaluator on movielens 1m data, the precision
>> and
>> > recall turned out to be, respectively,
>> >
>> > 0.011534185658699288
>> > 0.007905982905982885
>> >
>> > and the run took about 13 minutes on my intel core i3. (I used user based
>> > recommendation with k=2)
>> >
>> >
>> > Altgough I know that it is not ok to judge the performance of a
>> > recommendation algorithm by looking at these absolute precision and
>> recall
>> > values, still these numbers seems to me too low which might be the result
>> > of the second problem I mentioned above.
>> >
>> > Am I missing something?
>> >
>> > Thanks
>> > Ahmet
>> >
>>

Reply via email to