On Mon, Aug 9, 2010 at 6:46 PM, Yanir Seroussi <[email protected]> wrote: > As I see it, there are two separate tasks here. The first one is the > recommendation task, where it makes sense to take the N most similar users > and generate recommendations based on their preferences. The second one is > the rating prediction task (or preference estimation in Mahout terms), where > given a target user and a target item the recommender generates a prediction > of the rating for the item. In the way
They are related tasks. Estimating preferences is *a* way to achieve the further goal of recommending items. It is how just about all algorithms work, so I'd say that in practice estimating preference is a sub-task of recommendation, and really the core of it. > GenericUserBasedRecommendation.estimatePreference() is implemented now, it > can only generate predictions for very few items unless the neighbourhood > size is very large. I don't think this is necessarily true. Sparse data can mean users are hard to connect to other users, or that those connections yield few candidates for recommendations, sure. A large neighborhood doesn't help this though. If there isn't 1 "nearby" user, there aren't 100 either. But, crudely, I think your data has to be "really really sparse" before you frequently can't recommend anything at all. I don't believe it's a question of implementation. This is simply what user-based recommendation is. > To make this concrete, I'll give an example. Suppose we choose N=20, and we > want to predict the rating of the target user for the movie The Godfather. > This movie was rated by many users in the dataset, but unfortunately it > wasn't rated by the 20 users that are most similar to the target user. Thus, > estimatePreference() will return NaN as the prediction. That's right. There would be no basis for estimating a preference. > Now suppose that these top 20 users have a similarity of 1 to the target > user, and that the next 20 users have a similarity of 0.99 to the target > user, and they have all rated The Godfather. It would make sense to generate > a rating prediction based on the next 20 users, and this prediction is > actually likely to be pretty good. True, and if this were the case, I'd suggest that you want to define the neighborhood by a similarity threshold, not by size. Let everyone with similarity >= 0.9 or something be in the neighborhood. > As I mentioned in my original message, it is implied from Herlocker et al > (1999) that the number of neighbours shouldn't affect coverage. Another > widely cited paper, by Adomavicius and Tuzhilin (2005, see > http://ids.csom.umn.edu/faculty/gedas/papers/recommender-systems-survey-2005.pdf), > actually states that explicitly: "That is, the > value of the unknown rating r(c, s) for user c and item s is usually > computed as an aggregate of the ratings of some other (usually the N most > similar) users for the same item s: r(c, s) = aggr(c' in C^, r(c', s)) where > C^ denotes the set of N users that are the most similar to user c and who > have rated item s (N can range anywhere from 1 to the number of all users)." > > While Sean's approach is probably also okay, especially for the > recommendation scenario, I think it's worthwhile documenting the fact that > this behaviour is somewhat inconsistent with the literature. As (I think) you say, it's consistent with Herlocker -- this is just plain-vanilla user-based recommendation. The other approach you cite makes sense but gets a lot more expensive to compute. You need a new neighborhood for every candidate item. It is also, in my world view, not the plain-vanilla approach that has become called "user-based recommendation". So, no I do not find this inconsistent.
