There are two places in the code that make implementing content-based
recommendation with a custom ItemSimilarity very difficult. I ran into
these unknowingly some time ago.

AFAIK, the main purpose of using a content-based strategy would be to
handle the "cold-start" problem where no ratings exist for a new item
and a CF based approach cannot make any predictions.

This will unfortunately not work by only implementing a custom
ItemSimilarity, because before the ItemSimilarity implementation is
used, a set of candidate items has to be found in the DataModel. In our
default implementation all items that co-occurr with one of the users
preferred items are selected. If we have an item that has not been rated
yet, we will run into a NoSuchItemException here.

So a custom CandidateItemsStrategy will be necessary to make this work.

The situation is even worse when the most-similar-items need to be
computed, in GenericItemBasedRecommender.doMostSimilarItems(...) only
co-occurring items are selected too, but we did not implement an
exchangable strategy so this behavior cannot be customized currently.

I would suggest to create a similar construct like
CandidateItemsStrategy for most-similar-items too, any objections to that?

--sebastian



Am 30.12.2010 21:54, schrieb Sean Owen:
> You're on the right track. No I don't think the IDRescorer hurts. On
> the contrary it will save you from computing scores for movies that
> are not recommendable.
> 
> It's hard to say what the 'right' content-based similarity metric is,
> as it will depend a lot on what data you have as input. You don't have
> much side information to go in here; it's possible that being from the
> same genre (or by the same director, etc.) is of little or no
> predictive value no matter what you apply to this data. Still, seems
> like you may need such a metric as a fall-back for the case of new
> movies where there is no rating-based metric available.
> 
> You could hack up the code a little bit to do something like this: if
> too few similar items are found with the similarity metric, then
> compute similarities using the alternative content-based metric and
> proceed that way. It's a bit of a hack, and inelegant, but, may work
> well for you practice.
> 
> Slope-one isn't based on item-item similarity so no I don't think the
> notion of content-based similarity applies. It comes up in item-based
> recommenders only.
> 
> 
> On Thu, Dec 30, 2010 at 11:42 AM, Vasil Vangelovski
> <[email protected]> wrote:
>> Hi
>>
>> I started diving into mahout a few days ago. I've a basic understanding of
>> the machine learning concepts behind it, however I'm not all too familiar
>> with mahout beyond the first 6 chapters of "Mahout in action".
>>
>> I'm looking to implement the following kind of a recommendation engine (it's
>> not about movies but it's easiest to explain in this manner):
>>
>> Let's say I've the Movie Lens dataset. Complete with ratings, genres etc.
>> I'd want a recommender that would recommend only from a list of movies that
>> are showing in cinemas right now. That would be a list of 10-20 movies out
>> of  5000 for which there are ratings in the dataset.
>>
>> Given these are relatively new movies there will be a relatively low number
>> of ratings for them. So I guess I'd have to rely on content-based
>> recommendation of some kind.
>>
>> The first question is how would it affect performance if I use IDRescorer
>> for the purpose of just displaying an ordered list of recommendations in the
>> set of available movies (by implementing isFiltered, where the result would
>> be false most of the time 10/5000)?
>>
>> I know the simple way to implement content based CF would be to implement my
>> own ItemSimilarity based on ratings + movie genre information. However in
>> the case of the MovieLens dataset if I combine say pearson correlation for
>> ratings + tanimoto coefficient for genres (or whatever combination makes
>> sense here) it degrades performance (score) slightly for that dataset
>> compared to using pearson alone. Should I ditch this method just because of
>> this reason?
>>
>> Further what would be other ways to implement content-based data in order to
>> improve a recommender for the described use case? Is there a straightforward
>> way to integrate content-based knowledge into a slope-one recommender?
>>
>> Thanks
>>

Reply via email to