Most watched by that particular user. The issue is that the recommender is trying to answer, "of all items the user has not interacted with, which is the user most likely to interact with"? So the 'right answers' to the quiz it gets ought to be answers to this question. That is why the test data ought to be what appears to be the most interacted / preferred items.
For example If you watched 10 Star Trek episodes, then 1 episode of the Simpsons, and then held out the Simpson episode -- the recommender is almost surely not going to predict it, not above more Star Trek. That seems like correct behavior, but would be scored badly by a simple precision test. There are two downsides to this approach. Firstly removing well liked items from the training set may meaningfully skew a user's recommendations. It's not such a big issue if the test set is small -- and it should be. The second is that by taking out data this way you end up with a training set which never really existed at one point in time. That also could be a source of bias. Using recent data points tends to avoid both of these problem -- but then has the problem above. There's another approach I've been playing with, which works when the recommender produces some score for each rec, not just a ranked list. You can train on data up to a certain point in time, then have the recommender score the observations that really happened after that point. Ideally it should produce a high score for things that really were observed next. This isn't implemented in Mahout but you do get a score with recs even without ratings.
