Why recommend for all users -- why not just new ones or ones that have been updated? Yes, you're not intended to list all users into memory if using "-u".
A very crude rule of thumb is that you can compute about 100 recs per second on a normal machine, normal-sized data (no Hadoop). 8 machines would crank through 8.3M recs in 3 hours at best. Hadoop is going to be 3-4x slower than this due to its overheads. This pipeline probably takes 10 minutes or so to finish even with 0 input; that's the Hadoop overhead. If you're trying to finish computations in minutes, Hadoop probably isn't suitable. But, I think this all works much much better if you can only recompute users that have changed their prefs. On Tue, Oct 25, 2011 at 3:27 PM, Vishal Santoshi <[email protected]> wrote: > The data is big as in for a single day ( and I picked up an arbitrary day ) > > 8,335,013 users. > 256,010 distinct Items. > > I am using the Item Based Recommender ( The RecommenderJob ) , with no > Preference ( opt in is a signal of preference , multiple opt ins are > considered 1 ) > > <main-class>com.nytimes.computing.mahout.JobDriver</main-class> > <arg>recommender</arg> > <arg>--input</arg> > <arg>${out}/items/bag</arg> > <arg>--output</arg> > <arg>${out}/items_similarity</arg> > <arg>-u</arg> > <arg>${out}/items/users/part-r-00000</arg> > <arg>-b</arg> > <arg>-n</arg> > <arg>2</arg> > <arg>--similarityClassname</arg> > > <arg>org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity</arg> > <arg>--tempDir</arg> > <arg>${out}/temp</arg> > > Of course the Recommendations are for every user and thus > the RecommenderJob-PartialMultiplyMapper-AggregateAndRecommendReducer is the > most expensive of all. > Further , not sure why the user file is taken in as a Distributed File > especially when it may actually be a bigger file that a typical TaskTracker > JVM memory limit. > > > > In case of MinHash , MinHashDriver > > <java> > <job-tracker>${jobTracker}</job-tracker> > <name-node>${nameNode}</name-node> > <prepare> > <delete path="${out}/minhash"/> > </prepare> > <configuration> > <property> > <name>mapred.job.queue.name</name> > <value>${queueName}</value> > </property> > </configuration> > <main-class>com.nytimes.computing.mahout.JobDriver</main-class> > <arg>minhash_local</arg> > <arg>--input</arg> > <arg>${out}/bag</arg> > <arg>--output</arg> > <arg>${out}/minhash</arg> > <arg>--keyGroups</arg> <!-- Key Groups --> > <arg>2</arg> > <arg>-r</arg> <!-- Number of Reducers --> > <arg>40</arg> > <arg>--minClusterSize</arg> <!-- A legitimate cluster must have > this number of members --> > <arg>5</arg> > <arg>--hashType</arg> <!-- murmur and linear are the other 2 > options --> > <arg>polynomial</arg> > </java> > > This of course scales. I still have to work with the clusters created and a > fair amount of work has to be done to figure out which cluster is relevant. > > > A week of data in this case created the MinHash on our cluster in about 20 > minutes. > > > Regards. > > > On Tue, Oct 25, 2011 at 10:07 AM, Sean Owen <[email protected]> wrote: > >> Can you put any more numbers around this? how slow is slow, how big is big? >> What part of Mahout are you using -- or are you using Mahout? >> >> Item-based recommendation sounds fine. Anonymous users aren't a >> problem as long as you can distinguish them reasonably. >> I think your challenge is to have a data model that quickly drops out >> data from old items and can bring new items in. >> >> Is this small enough to do in memory? that's the simple, easy place to >> start. >> >> On Tue, Oct 25, 2011 at 2:59 PM, Vishal Santoshi >> <[email protected]> wrote: >> > Hello Folks, >> > The Item Based Recommendations for my dataset is >> > excruciatingly slow on a 8 node cluster. Yes the number of items is big >> and >> > the dataset churn does not allow for a long asynchronous process. >> > Recommendations cannot be stale ( a 30 minute delay is stale ). I have >> tried >> > out MinHash clustering and that is scalable, but without a "degree of >> > association" with multiple clusters any user may belong to , it seems >> less >> > tight that pure item based ( and thus similarity probability ) algorithm. >> > >> > Any ideas how we pull this off., where >> > >> > * The item churn is frequent. New items enter the dataset all the time. >> > * There is no "preference" apart from opt in. >> > * Very frequent anonymous users enter the system almost all the time. >> > >> > >> > Scale is very important. >> > >> > I am tending towards MinHash with additional algorithms that are executed >> > offline and co occurance. >> > >> >
