The number of interactions are about 30 million a day. >>The most important parameter (performancewise) is the newly introduced >> --maxPrefsPerUserInItemSimilarity
Will use it and keep you posted. >> broadcasts the similarity matrix via distributed cache This can be done. As the similarity matrix is unlikely blow up ( the items do not exponentially blow up ). >> definitely recommend to you try out the current trunk of Mahout I have build up mahout with the latest svn co mahout-core-0.6-SNAPSHOT.jar On Tue, Oct 25, 2011 at 10:42 AM, Sebastian Schelter <[email protected]> wrote: > Hello Vishal, > > How many interactions do you have between those users and items? I'd > definitely recommend to you try out the current trunk of Mahout as the > performance of RecommenderJob has significantly been improved. > > The most important parameter (performancewise) is the newly introduced > --maxPrefsPerUserInItemSimilarity which causes RecommenderJob to > downsample "power" users that can slow down the recommendation computing > process (without contributing much to the quality of the results). > > I'm currently running tests with a patched version of the new > RecommenderJob on a Yahoo Music dataset consisting of more than 700 > million ratings of 2 million users towards 140 thousand items (seems to > be similar to your user/item ratio) and seeing nice results with that > even though I run it on a small research cluster. > > If the phase after the item similarity computation takes too long (I > think you suspected this) than you can also try the patch from > https://issues.apache.org/jira/browse/MAHOUT-827 that broadcasts the > similarity matrix via distributed cache and computes the recommendations > in a map-only job. This could work well for your usecase as you have a > relatively small number of items. > > --sebastian > > > > On 25.10.2011 16:27, Vishal Santoshi wrote: > > The data is big as in for a single day ( and I picked up an arbitrary day > ) > > > > 8,335,013 users. > > 256,010 distinct Items. > > > > I am using the Item Based Recommender ( The RecommenderJob ) , with no > > Preference ( opt in is a signal of preference , multiple opt ins are > > considered 1 ) > > > > <main-class>com.nytimes.computing.mahout.JobDriver</main-class> > > <arg>recommender</arg> > > <arg>--input</arg> > > <arg>${out}/items/bag</arg> > > <arg>--output</arg> > > <arg>${out}/items_similarity</arg> > > <arg>-u</arg> > > <arg>${out}/items/users/part-r-00000</arg> > > <arg>-b</arg> > > <arg>-n</arg> > > <arg>2</arg> > > <arg>--similarityClassname</arg> > > > > > > <arg>org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity</arg> > > <arg>--tempDir</arg> > > <arg>${out}/temp</arg> > > > > Of course the Recommendations are for every user and thus > > the RecommenderJob-PartialMultiplyMapper-AggregateAndRecommendReducer is > the > > most expensive of all. > > Further , not sure why the user file is taken in as a Distributed File > > especially when it may actually be a bigger file that a typical > TaskTracker > > JVM memory limit. > > > > > > > > In case of MinHash , MinHashDriver > > > > <java> > > <job-tracker>${jobTracker}</job-tracker> > > <name-node>${nameNode}</name-node> > > <prepare> > > <delete path="${out}/minhash"/> > > </prepare> > > <configuration> > > <property> > > <name>mapred.job.queue.name</name> > > <value>${queueName}</value> > > </property> > > </configuration> > > > <main-class>com.nytimes.computing.mahout.JobDriver</main-class> > > <arg>minhash_local</arg> > > <arg>--input</arg> > > <arg>${out}/bag</arg> > > <arg>--output</arg> > > <arg>${out}/minhash</arg> > > <arg>--keyGroups</arg> <!-- Key Groups --> > > <arg>2</arg> > > <arg>-r</arg> <!-- Number of Reducers --> > > <arg>40</arg> > > <arg>--minClusterSize</arg> <!-- A legitimate cluster must > have > > this number of members --> > > <arg>5</arg> > > <arg>--hashType</arg> <!-- murmur and linear are the other 2 > > options --> > > <arg>polynomial</arg> > > </java> > > > > This of course scales. I still have to work with the clusters created and > a > > fair amount of work has to be done to figure out which cluster is > relevant. > > > > > > A week of data in this case created the MinHash on our cluster in about > 20 > > minutes. > > > > > > Regards. > > > > > > On Tue, Oct 25, 2011 at 10:07 AM, Sean Owen <[email protected]> wrote: > > > >> Can you put any more numbers around this? how slow is slow, how big is > big? > >> What part of Mahout are you using -- or are you using Mahout? > >> > >> Item-based recommendation sounds fine. Anonymous users aren't a > >> problem as long as you can distinguish them reasonably. > >> I think your challenge is to have a data model that quickly drops out > >> data from old items and can bring new items in. > >> > >> Is this small enough to do in memory? that's the simple, easy place to > >> start. > >> > >> On Tue, Oct 25, 2011 at 2:59 PM, Vishal Santoshi > >> <[email protected]> wrote: > >>> Hello Folks, > >>> The Item Based Recommendations for my dataset is > >>> excruciatingly slow on a 8 node cluster. Yes the number of items is big > >> and > >>> the dataset churn does not allow for a long asynchronous process. > >>> Recommendations cannot be stale ( a 30 minute delay is stale ). I have > >> tried > >>> out MinHash clustering and that is scalable, but without a "degree of > >>> association" with multiple clusters any user may belong to , it seems > >> less > >>> tight that pure item based ( and thus similarity probability ) > algorithm. > >>> > >>> Any ideas how we pull this off., where > >>> > >>> * The item churn is frequent. New items enter the dataset all the time. > >>> * There is no "preference" apart from opt in. > >>> * Very frequent anonymous users enter the system almost all the time. > >>> > >>> > >>> Scale is very important. > >>> > >>> I am tending towards MinHash with additional algorithms that are > executed > >>> offline and co occurance. > >>> > >> > > > >
