You're on the right track. No I don't think the IDRescorer hurts. On the contrary it will save you from computing scores for movies that are not recommendable.
It's hard to say what the 'right' content-based similarity metric is, as it will depend a lot on what data you have as input. You don't have much side information to go in here; it's possible that being from the same genre (or by the same director, etc.) is of little or no predictive value no matter what you apply to this data. Still, seems like you may need such a metric as a fall-back for the case of new movies where there is no rating-based metric available. You could hack up the code a little bit to do something like this: if too few similar items are found with the similarity metric, then compute similarities using the alternative content-based metric and proceed that way. It's a bit of a hack, and inelegant, but, may work well for you practice. Slope-one isn't based on item-item similarity so no I don't think the notion of content-based similarity applies. It comes up in item-based recommenders only. On Thu, Dec 30, 2010 at 11:42 AM, Vasil Vangelovski <[email protected]> wrote: > Hi > > I started diving into mahout a few days ago. I've a basic understanding of > the machine learning concepts behind it, however I'm not all too familiar > with mahout beyond the first 6 chapters of "Mahout in action". > > I'm looking to implement the following kind of a recommendation engine (it's > not about movies but it's easiest to explain in this manner): > > Let's say I've the Movie Lens dataset. Complete with ratings, genres etc. > I'd want a recommender that would recommend only from a list of movies that > are showing in cinemas right now. That would be a list of 10-20 movies out > of 5000 for which there are ratings in the dataset. > > Given these are relatively new movies there will be a relatively low number > of ratings for them. So I guess I'd have to rely on content-based > recommendation of some kind. > > The first question is how would it affect performance if I use IDRescorer > for the purpose of just displaying an ordered list of recommendations in the > set of available movies (by implementing isFiltered, where the result would > be false most of the time 10/5000)? > > I know the simple way to implement content based CF would be to implement my > own ItemSimilarity based on ratings + movie genre information. However in > the case of the MovieLens dataset if I combine say pearson correlation for > ratings + tanimoto coefficient for genres (or whatever combination makes > sense here) it degrades performance (score) slightly for that dataset > compared to using pearson alone. Should I ditch this method just because of > this reason? > > Further what would be other ways to implement content-based data in order to > improve a recommender for the described use case? Is there a straightforward > way to integrate content-based knowledge into a slope-one recommender? > > Thanks >
