I published an article in my blog at http://ssc.io recently that deals with scaling recommender systems, i'm sure it has some ideas you could adapt.
--sebastian Am 20.05.2011 20:02 schrieb "Ted Dunning" <[email protected]>: > Sean will be able to address scaling and configuration better than I, but I > have built video recommendation systems before and found that > > a) ratings are nearly worthless, largely because so few people will rate > things > > b) the best preference data we ever found was whether the user viewed the > asset longer than 30 seconds. This is a binary preference and it helps to > have it that way since you can make use of a number of economies. > > c) some randomization in recommendations is very important so that you > preserve some exploratory behavior. I implemented this by adding small > amounts of noise to recommendation scores to perturb the ranking. > > On Fri, May 20, 2011 at 10:31 AM, Varnit Khanna <[email protected]> wrote: > >> Hi, >> I have been considering using mahout for our recommendation engine >> needs and had couple of questions about using it in production. >> >> Use Case: >> We need to provide recommendation on video assets (similar to hulu) to >> couple of million users and we have over 100K assets. Since we are >> experiencing growth both in users and assets I am planning to use >> mahout on hadoop. >> >> Preference Data: >> Currently we do not have a ratings system built into our video >> player/page but we do have logs on user impressions on video assets >> which I will be feeding into RecommenderJob. Until we build a ratings >> system I am planning on using the following preference data: >> >> Impressions | Rating >> 1 | (empty) >> 2 | 2 >> 3 | 3 >> 4 | 4 >> >=5 | 5 >> >> Does this preference data make sense? I will be using the standard >> RecommenderJob to generate recommendations until I get a better >> understanding of mahout. >> >> Questions: >> 1) What will be the best approach to deal with cold start on new >> assets and users? >> 2) Is it typical to parse the entire dataset in production to generate >> recommendations for new assets and users or can it be done >> incrementally? >> 3) What is a better approach for this use case item or user based CF? >> Also at some point in the future we would like to generate >> recommendations on news assets so a single system might be beneficial. >> >> Thanks >> -varnit >>
