I think, it is better to choose ratings of the test user in a random fashion.
On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen <[email protected]> wrote: > Yes. But: the test sample is small. Using 40% of your data to test is > probably quite too much. > > My point is that it may be the least-bad thing to do. What test are you > proposing instead, and why is it coherent with what you're testing? > > > > > On Sat, Feb 16, 2013 at 8:26 PM, Ahmet Ylmaz > <[email protected]>wrote: > >> But modeling a user only by his/her low ratings can be problematic since >> people generally are more precise (I believe) in their high ratings. >> Another problem is that recommender algorithms in general first mean >> normalize the ratings for each user. Suppose that we have the following >> ratings of 3 people (A, B, and C) on 5 items. >> >> A's ratings: 1 2 3 4 5 >> B's ratings: 1 3 5 2 4 >> C's ratings: 1 2 3 4 5 >> >> >> Suppose that A is the test user. Now if we put only the low ratings of A >> (1, 2, and 3) into the training set and mean normalize the ratings then A >> will be >> more similar to B than C, which is not true. >> >> >> >> >> ________________________________ >> From: Sean Owen <[email protected]> >> To: Mahout User List <[email protected]>; Ahmet Ylmaz < >> [email protected]> >> Sent: Saturday, February 16, 2013 8:41 PM >> Subject: Re: Problems with Mahout's RecommenderIRStatsEvaluator >> >> No, this is not a problem. >> >> Yes it builds a model for each user, which takes a long time. It's >> accurate, but time-consuming. It's meant for small data. You could rewrite >> your own test to hold out data for all test users at once. That's what I >> did when I rewrote a lot of this just because it was more useful to have >> larger tests. >> >> There are several ways to choose the test data. One common way is by time, >> but there is no time information here by default. The problem is that, for >> example, recent ratings may be low -- or at least not high ratings. But the >> evaluation is of course asking the recommender for items that are predicted >> to be highly rated. Random selection has the same problem. Choosing by >> rating at least makes the test coherent. >> >> It does bias the training set, but, the test set is supposed to be small. >> >> There is no way to actually know, a priori, what the top recommendations >> are. You have no information to evaluate most recommendations. This makes a >> precision/recall test fairly uninformative in practice. Still, it's better >> than nothing and commonly understood. >> >> While precision/recall won't be high on tests like this, because of this, I >> don't get these values for movielens data on any normal algo, but, you may >> be, if choosing an algorithm or parameters that don't work well. >> >> >> >> >> On Sat, Feb 16, 2013 at 7:30 PM, Ahmet Ylmaz <[email protected] >> >wrote: >> >> > Hi, >> > >> > I have looked at the internals of Mahout's RecommenderIRStatsEvaluator >> > code. I think that there are two important problems here. >> > >> > According to my understanding the experimental protocol used in this code >> > is something like this: >> > >> > It takes away a certain percentage of users as test users. >> > For >> > each test user it builds a training set consisting of ratings given by >> > all other users + the ratings of the test user which are below the >> > relevanceThreshold. >> > It then builds a model and makes a >> > recommendation to the test user and finds the intersection between this >> > recommendation list and the items which are rated above the >> > relevanceThreshold by the test user. >> > It then calculates the precision and recall in the usual way. >> > >> > Probems: >> > 1. (mild) It builds a model for every test user which can take a lot of >> > time. >> > >> > 2. (severe) Only the ratings (of the test user) which are below the >> > relevanceThreshold are put into the training set. This means that the >> > algorithm >> > only knows the preferences of the test user about the items which s/he >> > don't like. This is not a good representation of user ratings. >> > >> > Moreover when I run this evaluator on movielens 1m data, the precision >> and >> > recall turned out to be, respectively, >> > >> > 0.011534185658699288 >> > 0.007905982905982885 >> > >> > and the run took about 13 minutes on my intel core i3. (I used user based >> > recommendation with k=2) >> > >> > >> > Altgough I know that it is not ok to judge the performance of a >> > recommendation algorithm by looking at these absolute precision and >> recall >> > values, still these numbers seems to me too low which might be the result >> > of the second problem I mentioned above. >> > >> > Am I missing something? >> > >> > Thanks >> > Ahmet >> > >>
