Yes. But: the test sample is small. Using 40% of your data to test is probably quite too much.
My point is that it may be the least-bad thing to do. What test are you proposing instead, and why is it coherent with what you're testing? On Sat, Feb 16, 2013 at 8:26 PM, Ahmet Ylmaz <[email protected]>wrote: > But modeling a user only by his/her low ratings can be problematic since > people generally are more precise (I believe) in their high ratings. > Another problem is that recommender algorithms in general first mean > normalize the ratings for each user. Suppose that we have the following > ratings of 3 people (A, B, and C) on 5 items. > > A's ratings: 1 2 3 4 5 > B's ratings: 1 3 5 2 4 > C's ratings: 1 2 3 4 5 > > > Suppose that A is the test user. Now if we put only the low ratings of A > (1, 2, and 3) into the training set and mean normalize the ratings then A > will be > more similar to B than C, which is not true. > > > > > ________________________________ > From: Sean Owen <[email protected]> > To: Mahout User List <[email protected]>; Ahmet Ylmaz < > [email protected]> > Sent: Saturday, February 16, 2013 8:41 PM > Subject: Re: Problems with Mahout's RecommenderIRStatsEvaluator > > No, this is not a problem. > > Yes it builds a model for each user, which takes a long time. It's > accurate, but time-consuming. It's meant for small data. You could rewrite > your own test to hold out data for all test users at once. That's what I > did when I rewrote a lot of this just because it was more useful to have > larger tests. > > There are several ways to choose the test data. One common way is by time, > but there is no time information here by default. The problem is that, for > example, recent ratings may be low -- or at least not high ratings. But the > evaluation is of course asking the recommender for items that are predicted > to be highly rated. Random selection has the same problem. Choosing by > rating at least makes the test coherent. > > It does bias the training set, but, the test set is supposed to be small. > > There is no way to actually know, a priori, what the top recommendations > are. You have no information to evaluate most recommendations. This makes a > precision/recall test fairly uninformative in practice. Still, it's better > than nothing and commonly understood. > > While precision/recall won't be high on tests like this, because of this, I > don't get these values for movielens data on any normal algo, but, you may > be, if choosing an algorithm or parameters that don't work well. > > > > > On Sat, Feb 16, 2013 at 7:30 PM, Ahmet Ylmaz <[email protected] > >wrote: > > > Hi, > > > > I have looked at the internals of Mahout's RecommenderIRStatsEvaluator > > code. I think that there are two important problems here. > > > > According to my understanding the experimental protocol used in this code > > is something like this: > > > > It takes away a certain percentage of users as test users. > > For > > each test user it builds a training set consisting of ratings given by > > all other users + the ratings of the test user which are below the > > relevanceThreshold. > > It then builds a model and makes a > > recommendation to the test user and finds the intersection between this > > recommendation list and the items which are rated above the > > relevanceThreshold by the test user. > > It then calculates the precision and recall in the usual way. > > > > Probems: > > 1. (mild) It builds a model for every test user which can take a lot of > > time. > > > > 2. (severe) Only the ratings (of the test user) which are below the > > relevanceThreshold are put into the training set. This means that the > > algorithm > > only knows the preferences of the test user about the items which s/he > > don't like. This is not a good representation of user ratings. > > > > Moreover when I run this evaluator on movielens 1m data, the precision > and > > recall turned out to be, respectively, > > > > 0.011534185658699288 > > 0.007905982905982885 > > > > and the run took about 13 minutes on my intel core i3. (I used user based > > recommendation with k=2) > > > > > > Altgough I know that it is not ok to judge the performance of a > > recommendation algorithm by looking at these absolute precision and > recall > > values, still these numbers seems to me too low which might be the result > > of the second problem I mentioned above. > > > > Am I missing something? > > > > Thanks > > Ahmet > > >
