There are two places in the code that make implementing content-based recommendation with a custom ItemSimilarity very difficult. I ran into these unknowingly some time ago.
AFAIK, the main purpose of using a content-based strategy would be to handle the "cold-start" problem where no ratings exist for a new item and a CF based approach cannot make any predictions. This will unfortunately not work by only implementing a custom ItemSimilarity, because before the ItemSimilarity implementation is used, a set of candidate items has to be found in the DataModel. In our default implementation all items that co-occurr with one of the users preferred items are selected. If we have an item that has not been rated yet, we will run into a NoSuchItemException here. So a custom CandidateItemsStrategy will be necessary to make this work. The situation is even worse when the most-similar-items need to be computed, in GenericItemBasedRecommender.doMostSimilarItems(...) only co-occurring items are selected too, but we did not implement an exchangable strategy so this behavior cannot be customized currently. I would suggest to create a similar construct like CandidateItemsStrategy for most-similar-items too, any objections to that? --sebastian Am 30.12.2010 21:54, schrieb Sean Owen: > You're on the right track. No I don't think the IDRescorer hurts. On > the contrary it will save you from computing scores for movies that > are not recommendable. > > It's hard to say what the 'right' content-based similarity metric is, > as it will depend a lot on what data you have as input. You don't have > much side information to go in here; it's possible that being from the > same genre (or by the same director, etc.) is of little or no > predictive value no matter what you apply to this data. Still, seems > like you may need such a metric as a fall-back for the case of new > movies where there is no rating-based metric available. > > You could hack up the code a little bit to do something like this: if > too few similar items are found with the similarity metric, then > compute similarities using the alternative content-based metric and > proceed that way. It's a bit of a hack, and inelegant, but, may work > well for you practice. > > Slope-one isn't based on item-item similarity so no I don't think the > notion of content-based similarity applies. It comes up in item-based > recommenders only. > > > On Thu, Dec 30, 2010 at 11:42 AM, Vasil Vangelovski > <[email protected]> wrote: >> Hi >> >> I started diving into mahout a few days ago. I've a basic understanding of >> the machine learning concepts behind it, however I'm not all too familiar >> with mahout beyond the first 6 chapters of "Mahout in action". >> >> I'm looking to implement the following kind of a recommendation engine (it's >> not about movies but it's easiest to explain in this manner): >> >> Let's say I've the Movie Lens dataset. Complete with ratings, genres etc. >> I'd want a recommender that would recommend only from a list of movies that >> are showing in cinemas right now. That would be a list of 10-20 movies out >> of 5000 for which there are ratings in the dataset. >> >> Given these are relatively new movies there will be a relatively low number >> of ratings for them. So I guess I'd have to rely on content-based >> recommendation of some kind. >> >> The first question is how would it affect performance if I use IDRescorer >> for the purpose of just displaying an ordered list of recommendations in the >> set of available movies (by implementing isFiltered, where the result would >> be false most of the time 10/5000)? >> >> I know the simple way to implement content based CF would be to implement my >> own ItemSimilarity based on ratings + movie genre information. However in >> the case of the MovieLens dataset if I combine say pearson correlation for >> ratings + tanimoto coefficient for genres (or whatever combination makes >> sense here) it degrades performance (score) slightly for that dataset >> compared to using pearson alone. Should I ditch this method just because of >> this reason? >> >> Further what would be other ways to implement content-based data in order to >> improve a recommender for the described use case? Is there a straightforward >> way to integrate content-based knowledge into a slope-one recommender? >> >> Thanks >>
