Re: Understanding mahout's recommendation system parameters

Sebastian Schelter Thu, 14 Jul 2011 08:06:11 -0700

Hi Jack,

trying to answer your questions as detailed as possible:


Regarding point 2) --maxSimilaritiesPerItem

RecommenderJob uses Itembased Collaborative Filtering to compute therecommendations and is a parallelized implementation of the algorithmpresented in [1]. The main idea is to use a "neighbourhood" of similaritems that have already been rated by a user to estimate his/herpreference towards an unknown item. These similar items are found bycomparing the ratings of frequently co-rated items according to somesimilarity measure. The parameter --maxSimilaritiesPerItem lets youspecify the number of similar items per item to consider when estimatingpreferences towards an unknown item. Usually a small number of itemsshould be sufficient, have a look into [1] for some numbers and experiments.


Regarding point 1) --maxCooccurrencesPerItem

In order to compute the item-item-similarities a naive approach wouldhave to consider all possible pairs of items which has quadraticcomplexity and obviously won't scale.

RowSimilarityJob which is at the heart of both RecommenderJob andItemSimilarityJob ensures that only pairs of items that have at leastbeen co-rated once are taken into consideration which helps a lot inrecommendation usecases as most users have only rated a very smallnumber of items.

However if you look at the distribution of the number of ratings peruser or per item, it will usually follow a heavily tailed distribution,which means that there is a small number of items ("topsellers") with anexorbitant number of ratings as well as a small number of users("powerusers") that show the same behavior.

These powerusers and topsellers might slow down the similaritycomputation orders of magnitude (as all pairs of items that have beenco-rated have to be considered which is still quadratic growth) withoutproviding a lot of additional insight. I think Ted wrote a mail to thislist some time ago where he confirmed this observation from his experience.

So we need some way to sample down these ratings which is done inMaybePruneRowsMapper with a very simple heuristic using--maxCooccurrencesPerItem that only looks at the portion of dataavailable for that single mapper instance and might throw away ratingsfor very frequently rated items.

I think this is a point where a lot of optimization is possible, Mahoutshould provide support for customizable sampling strategies here, likelooking only at the x latest ratings of a user for example.



--sebastian

[1] Sarwar et. al. "Itembased Collaborative Filtering Algorithms"http://portal.acm.org/citation.cfm?id=372071



On 14.07.2011 16:11, Kris Jack wrote:

Hello,

I'm trying to get a better understanding of the following 2 RecommenderJob
parameters:
1) --maxCooccurrencesPerItem (integer): Maximum number of cooccurrences
considered per item (100)
2) --maxSimilaritiesPerItem (integer): Maximum number of similarities
considered per item (100)

Could you please help me to understand these in terms of a recommender job
where we are trying to recommend items to users?

 From what I see, maxCooccurrencesPerItem first gets used in job 4/12 in the
pipeline, the MaybePruneRowsMapper job.  Does maxCooccurrencesPerItem limit
the number of cooccurrences that are kept for that item?  Is this limit
within a single user's set of items or globally for all users?  For example,
if a user has 100 items then each item can be seen to cooccur with the 99
other items.  Taking all user libraries, however, assume that it cooccurs
with 1,000,000 other items.  Does maxCooccurrencesPerItem limit the number
of cooccurrences on a user item set basis or is this applied to the set of
items with which the item cooccurs with regard to all user libraries?  Also,
how is the selection made (most frequent or first found)?

maxSimilaritiesPerItem first gets used in job 7/12 in the pipeline,
EntriesToVectorsReducer.  Does this cap the number of rows that are compared
with one another?  Are the rows cooccurrence vectors of items for a given
user by this point in the process?

Thanks,
Kris

Re: Understanding mahout's recommendation system parameters

Reply via email to