Re: A question regarding GenericUserBasedRecommender
I agree with both of you (Ted and Sean) regarding the generation of real-world recommendations, and I also think that it makes sense to use a static neighbourhood when generating recommendations, because it will be too slow to estimate the target user's rating of all the items in the system. I still think that it's worthwhile to compare rating errors of different algorithms on all items, because in general it is likely that the more accurate algorithm will generate better recommendations. Comparing the actual generated recommendations is not as simple as comparing rating errors, though it is very important. As Ted said, you need to worry about the ordering of the top few items, and you typically don't want all the recommended items to be too similar to each other or too obvious. However, capturing all these qualities in a single metric is quite hard, if not impossible. In any case, I think that the current behaviour should be documented, either in the javadocs or in the wiki or both. I'm happy to update the documentation if it's okay with you. On Fri, Aug 13, 2010 at 06:09, Ted Dunning ted.dunn...@gmail.com wrote: Focussing on rating error is also problematic in that it causes us to worry about being correct about the estimated ratings for items that will *never* be shown to a user. In my mind, the only thing that matters in a practical system is the ordering of the top few items and the rough composition of the top few hundred items. Everything else makes no difference at all for real-world recommendations. On Thu, Aug 12, 2010 at 12:53 PM, Sean Owen sro...@gmail.com wrote: I agree with your reading of what the Herlocker paper is saying. The paper is focused on producing one estimated rating, not recommendations. While those tasks are related -- recommendations are those with the highest estimated ratings -- translating what's in Herlocker directly to a recommendation algorithm is a significant jump.
Re: A question regarding GenericUserBasedRecommender
Focussing on rating error is also problematic in that it causes us to worry about being correct about the estimated ratings for items that will *never* be shown to a user. In my mind, the only thing that matters in a practical system is the ordering of the top few items and the rough composition of the top few hundred items. Everything else makes no difference at all for real-world recommendations. On Thu, Aug 12, 2010 at 12:53 PM, Sean Owen sro...@gmail.com wrote: I agree with your reading of what the Herlocker paper is saying. The paper is focused on producing one estimated rating, not recommendations. While those tasks are related -- recommendations are those with the highest estimated ratings -- translating what's in Herlocker directly to a recommendation algorithm is a significant jump.
Re: A question regarding GenericUserBasedRecommender
On Tue, Aug 10, 2010 at 15:54, Sean Owen sro...@gmail.com wrote: On Mon, Aug 9, 2010 at 6:46 PM, Yanir Seroussi yanir.serou...@gmail.com wrote: As I see it, there are two separate tasks here. The first one is the recommendation task, where it makes sense to take the N most similar users and generate recommendations based on their preferences. The second one is the rating prediction task (or preference estimation in Mahout terms), where given a target user and a target item the recommender generates a prediction of the rating for the item. In the way They are related tasks. Estimating preferences is *a* way to achieve the further goal of recommending items. It is how just about all algorithms work, so I'd say that in practice estimating preference is a sub-task of recommendation, and really the core of it. Agreed. GenericUserBasedRecommendation.estimatePreference() is implemented now, it can only generate predictions for very few items unless the neighbourhood size is very large. I don't think this is necessarily true. Sparse data can mean users are hard to connect to other users, or that those connections yield few candidates for recommendations, sure. A large neighborhood doesn't help this though. If there isn't 1 nearby user, there aren't 100 either. But, crudely, I think your data has to be really really sparse before you frequently can't recommend anything at all. I don't believe it's a question of implementation. This is simply what user-based recommendation is. As it is now, GenericUserBasedRecommendation.estimatePreference() can only estimate preferences of items that were rated by users in the static neighbourhood. Thus, a larger neighbourhood is likely to contain more neighbours that rated unknown items. For example, if we take only one neighbour, we can only estimate preferences of items that this neighbour rated. To make this concrete, I'll give an example. Suppose we choose N=20, and we want to predict the rating of the target user for the movie The Godfather. This movie was rated by many users in the dataset, but unfortunately it wasn't rated by the 20 users that are most similar to the target user. Thus, estimatePreference() will return NaN as the prediction. That's right. There would be no basis for estimating a preference. Now suppose that these top 20 users have a similarity of 1 to the target user, and that the next 20 users have a similarity of 0.99 to the target user, and they have all rated The Godfather. It would make sense to generate a rating prediction based on the next 20 users, and this prediction is actually likely to be pretty good. True, and if this were the case, I'd suggest that you want to define the neighborhood by a similarity threshold, not by size. Let everyone with similarity = 0.9 or something be in the neighborhood. As I mentioned in my original message, it is implied from Herlocker et al (1999) that the number of neighbours shouldn't affect coverage. Another widely cited paper, by Adomavicius and Tuzhilin (2005, see http://ids.csom.umn.edu/faculty/gedas/papers/recommender-systems-survey-2005.pdf ), actually states that explicitly: That is, the value of the unknown rating r(c, s) for user c and item s is usually computed as an aggregate of the ratings of some other (usually the N most similar) users for the same item s: r(c, s) = aggr(c' in C^, r(c', s)) where C^ denotes the set of N users that are the most similar to user c and who have rated item s (N can range anywhere from 1 to the number of all users). While Sean's approach is probably also okay, especially for the recommendation scenario, I think it's worthwhile documenting the fact that this behaviour is somewhat inconsistent with the literature. As (I think) you say, it's consistent with Herlocker -- this is just plain-vanilla user-based recommendation. The other approach you cite makes sense but gets a lot more expensive to compute. You need a new neighborhood for every candidate item. It is also, in my world view, not the plain-vanilla approach that has become called user-based recommendation. So, no I do not find this inconsistent. No, I'm not saying that it's consistent with Herlocker et al. They say in section 6 that The second technique is to pick the best n correlates for a given n. This technique performs reasonably well, as it does not limit prediction coverage. The fact that no matter what n you choose, your coverage remains the same, means that the n users are dependent on the target item. Also, in section 2 of the paper, they say that the second step in generating a prediction is select a subset of users to use as a set of predictors (possibly for a specific item). The other neighbourhood selection technique they experimented with was setting similarity threshold, which did affect coverage, as it is not item-specific. Putting all of this together strongly suggests
Re: A question regarding GenericUserBasedRecommender
I'll try to explain the problem again, but note that at this stage I'm trying to establish a user-based collaborative recommendation baseline, rather than getting the best performance I can get for a given dataset, so while the suggestions are appreciated I'd like to focus on GenericUserBasedRecommender with NearestNUserNeighborhood. As I see it, there are two separate tasks here. The first one is the recommendation task, where it makes sense to take the N most similar users and generate recommendations based on their preferences. The second one is the rating prediction task (or preference estimation in Mahout terms), where given a target user and a target item the recommender generates a prediction of the rating for the item. In the way GenericUserBasedRecommendation.estimatePreference() is implemented now, it can only generate predictions for very few items unless the neighbourhood size is very large. To make this concrete, I'll give an example. Suppose we choose N=20, and we want to predict the rating of the target user for the movie The Godfather. This movie was rated by many users in the dataset, but unfortunately it wasn't rated by the 20 users that are most similar to the target user. Thus, estimatePreference() will return NaN as the prediction. Now suppose that these top 20 users have a similarity of 1 to the target user, and that the next 20 users have a similarity of 0.99 to the target user, and they have all rated The Godfather. It would make sense to generate a rating prediction based on the next 20 users, and this prediction is actually likely to be pretty good. As I mentioned in my original message, it is implied from Herlocker et al (1999) that the number of neighbours shouldn't affect coverage. Another widely cited paper, by Adomavicius and Tuzhilin (2005, see http://ids.csom.umn.edu/faculty/gedas/papers/recommender-systems-survey-2005.pdf), actually states that explicitly: That is, the value of the unknown rating r(c, s) for user c and item s is usually computed as an aggregate of the ratings of some other (usually the N most similar) users for the same item s: r(c, s) = aggr(c' in C^, r(c', s)) where C^ denotes the set of N users that are the most similar to user c and who have rated item s (N can range anywhere from 1 to the number of all users). While Sean's approach is probably also okay, especially for the recommendation scenario, I think it's worthwhile documenting the fact that this behaviour is somewhat inconsistent with the literature. On Tue, Aug 10, 2010 at 00:32, Sean Owen sro...@gmail.com wrote: This is expected behavior as far as I understand the algorithm. I don't see how a user-based recommender can estimated a preference by X for Y if nobody who rated Y is connected to X at all. You can use a PreferenceInferrer to fill in a lot of missing data, but I don't really recommend this for more than experimentation. The issue here is mostly that the user-item matrix is too sparse. And yes there are load of follow-up suggestions that tackle that, depending on your data, as alex hinted at. On Mon, Aug 9, 2010 at 3:31 AM, Yanir Seroussi yanir.serou...@gmail.com wrote: Hi, The first example here ( https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation ) shows how to create a GenericUserBasedRecommender with a NearestNUserNeighborhood. My problem/question is that setting n to any small number seems to limit the coverage of the recommender, because the nearest n users are calculated without taking the target item into account. For example, given a user X and n = 10, if we want to estimatePreference() for an item Y, if this item is not rated by any user in the neighbourhood, the prediction will be NaN. I don't think that this is what one would expect from a user-based nearest-neighbour recommender, as Herlocker et al. (1999), who are cited in the example page above, didn't mention any change in coverage based on the number of nearest neighbours. Am I doing something wrong, or is this the way it should be? I have a feeling it is not the way it should be, because then using small neighbourhood sizes makes no sense as it severely restricts the ability of the recommender to estimate preferences. Please note that I observed this behaviour in version 0.3, but it seems to be the same in the latest version. Cheers, Yanir