I'll try to explain the problem again, but note that at this stage I'm
trying to establish a user-based collaborative recommendation baseline,
rather than getting the best performance I can get for a given dataset, so
while the suggestions are appreciated I'd like to focus on
GenericUserBasedRecommender with NearestNUserNeighborhood.

As I see it, there are two separate tasks here. The first one is the
recommendation task, where it makes sense to take the N most similar users
and generate recommendations based on their preferences. The second one is
the rating prediction task (or preference estimation in Mahout terms), where
given a target user and a target item the recommender generates a prediction
of the rating for the item. In the way
GenericUserBasedRecommendation.estimatePreference() is implemented now, it
can only generate predictions for very few items unless the neighbourhood
size is very large.

To make this concrete, I'll give an example. Suppose we choose N=20, and we
want to predict the rating of the target user for the movie The Godfather.
This movie was rated by many users in the dataset, but unfortunately it
wasn't rated by the 20 users that are most similar to the target user. Thus,
estimatePreference() will return NaN as the prediction.
Now suppose that these top 20 users have a similarity of 1 to the target
user, and that the next 20 users have a similarity of 0.99 to the target
user, and they have all rated The Godfather. It would make sense to generate
a rating prediction based on the next 20 users, and this prediction is
actually likely to be pretty good.

As I mentioned in my original message, it is implied from Herlocker et al
(1999) that the number of neighbours shouldn't affect coverage. Another
widely cited paper, by Adomavicius and Tuzhilin (2005, see
http://ids.csom.umn.edu/faculty/gedas/papers/recommender-systems-survey-2005.pdf),
actually states that explicitly: "That is, the
value of the unknown rating r(c, s) for user c and item s is usually
computed as an aggregate of the ratings of some other (usually the N most
similar) users for the same item s: r(c, s) = aggr(c' in C^, r(c', s)) where
C^ denotes the set of N users that are the most similar to user c and who
have rated item s (N can range anywhere from 1 to the number of all users)."

While Sean's approach is probably also okay, especially for the
recommendation scenario, I think it's worthwhile documenting the fact that
this behaviour is somewhat inconsistent with the literature.

On Tue, Aug 10, 2010 at 00:32, Sean Owen <sro...@gmail.com> wrote:

> This is expected behavior as far as I understand the algorithm. I
> don't see how a user-based recommender can estimated a preference by X
> for Y if nobody who rated Y is connected to X at all.
>
> You can use a PreferenceInferrer to fill in a lot of missing data, but
> I don't really recommend this for more than experimentation.
>
> The issue here is mostly that the user-item matrix is too sparse. And
> yes there are load of follow-up suggestions that tackle that,
> depending on your data, as alex hinted at.
>
> On Mon, Aug 9, 2010 at 3:31 AM, Yanir Seroussi <yanir.serou...@gmail.com>
> wrote:
> > Hi,
> >
> > The first example here (
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation
> )
> > shows how to create a GenericUserBasedRecommender with a
> > NearestNUserNeighborhood. My problem/question is that setting n to any
> small
> > number seems to limit the coverage of the recommender, because the
> nearest n
> > users are calculated without taking the target item into account.
> > For example, given a user X and n = 10, if we want to
> estimatePreference()
> > for an item Y, if this item is not rated by any user in the
> neighbourhood,
> > the prediction will be NaN. I don't think that this is what one would
> expect
> > from a user-based nearest-neighbour recommender, as Herlocker et al.
> (1999),
> > who are cited in the example page above, didn't mention any change in
> > coverage based on the number of nearest neighbours.
> > Am I doing something wrong, or is this the way it should be? I have a
> > feeling it is not the way it should be, because then using small
> > neighbourhood sizes makes no sense as it severely restricts the ability
> of
> > the recommender to estimate preferences.
> >
> > Please note that I observed this behaviour in version 0.3, but it seems
> to
> > be the same in the latest version.
> >
> > Cheers,
> > Yanir
> >
>

Reply via email to