>> But, I think this all works much much better if you can only recompute >> users that have changed their prefs.
In our case the preferences is a user clicking on an article ( which doubles as an item ). And these articles are introduced at a frequent rate. Thus the number of new items that occur in the dataset has a very frequent churn and thus not necessarily having any history. Of course we need to recommend the latest item. So the issues are * We have Users that have had an historical click history. * We have new items that will potentially click on. * The brand new Users that may/have to be checked against Users that have a history ( to find similarity ). * Recommendation on old items though OK have staleness associated. Unlike Amazon or Net Flicks , staleness, churn etc is a real big deal. I realize the overhead for Hadoop, yet the data is cumulative as in we would rather go for a sliding window of 3 weeks. We do want to do it every 3-4 hours for every user ( A user will come back any time ). We do realize that the offline part of the computation will likely to be a part of the solution. What would you do based on what our requirements are. For me * Offline clustering. ( PLSI + Minhash/Recommendation ( Item ) * Cooccurance on Item ( for new Users who have no history ). On Tue, Oct 25, 2011 at 10:43 AM, Sean Owen <[email protected]> wrote: > Why recommend for all users -- why not just new ones or ones that have > been updated? Yes, you're not intended to list all users into memory > if using "-u". > > A very crude rule of thumb is that you can compute about 100 recs per > second on a normal machine, normal-sized data (no Hadoop). 8 machines > would crank through 8.3M recs in 3 hours at best. Hadoop is going to > be 3-4x slower than this due to its overheads. > > This pipeline probably takes 10 minutes or so to finish even with 0 > input; that's the Hadoop overhead. If you're trying to finish > computations in minutes, Hadoop probably isn't suitable. > > But, I think this all works much much better if you can only recompute > users that have changed their prefs. > > > On Tue, Oct 25, 2011 at 3:27 PM, Vishal Santoshi > <[email protected]> wrote: > > The data is big as in for a single day ( and I picked up an arbitrary day > ) > > > > 8,335,013 users. > > 256,010 distinct Items. > > > > I am using the Item Based Recommender ( The RecommenderJob ) , with no > > Preference ( opt in is a signal of preference , multiple opt ins are > > considered 1 ) > > > > <main-class>com.nytimes.computing.mahout.JobDriver</main-class> > > <arg>recommender</arg> > > <arg>--input</arg> > > <arg>${out}/items/bag</arg> > > <arg>--output</arg> > > <arg>${out}/items_similarity</arg> > > <arg>-u</arg> > > <arg>${out}/items/users/part-r-00000</arg> > > <arg>-b</arg> > > <arg>-n</arg> > > <arg>2</arg> > > <arg>--similarityClassname</arg> > > > > > > <arg>org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity</arg> > > <arg>--tempDir</arg> > > <arg>${out}/temp</arg> > > > > Of course the Recommendations are for every user and thus > > the RecommenderJob-PartialMultiplyMapper-AggregateAndRecommendReducer is > the > > most expensive of all. > > Further , not sure why the user file is taken in as a Distributed File > > especially when it may actually be a bigger file that a typical > TaskTracker > > JVM memory limit. > > > > > > > > In case of MinHash , MinHashDriver > > > > <java> > > <job-tracker>${jobTracker}</job-tracker> > > <name-node>${nameNode}</name-node> > > <prepare> > > <delete path="${out}/minhash"/> > > </prepare> > > <configuration> > > <property> > > <name>mapred.job.queue.name</name> > > <value>${queueName}</value> > > </property> > > </configuration> > > > <main-class>com.nytimes.computing.mahout.JobDriver</main-class> > > <arg>minhash_local</arg> > > <arg>--input</arg> > > <arg>${out}/bag</arg> > > <arg>--output</arg> > > <arg>${out}/minhash</arg> > > <arg>--keyGroups</arg> <!-- Key Groups --> > > <arg>2</arg> > > <arg>-r</arg> <!-- Number of Reducers --> > > <arg>40</arg> > > <arg>--minClusterSize</arg> <!-- A legitimate cluster must > have > > this number of members --> > > <arg>5</arg> > > <arg>--hashType</arg> <!-- murmur and linear are the other 2 > > options --> > > <arg>polynomial</arg> > > </java> > > > > This of course scales. I still have to work with the clusters created and > a > > fair amount of work has to be done to figure out which cluster is > relevant. > > > > > > A week of data in this case created the MinHash on our cluster in about > 20 > > minutes. > > > > > > Regards. > > > > > > On Tue, Oct 25, 2011 at 10:07 AM, Sean Owen <[email protected]> wrote: > > > >> Can you put any more numbers around this? how slow is slow, how big is > big? > >> What part of Mahout are you using -- or are you using Mahout? > >> > >> Item-based recommendation sounds fine. Anonymous users aren't a > >> problem as long as you can distinguish them reasonably. > >> I think your challenge is to have a data model that quickly drops out > >> data from old items and can bring new items in. > >> > >> Is this small enough to do in memory? that's the simple, easy place to > >> start. > >> > >> On Tue, Oct 25, 2011 at 2:59 PM, Vishal Santoshi > >> <[email protected]> wrote: > >> > Hello Folks, > >> > The Item Based Recommendations for my dataset is > >> > excruciatingly slow on a 8 node cluster. Yes the number of items is > big > >> and > >> > the dataset churn does not allow for a long asynchronous process. > >> > Recommendations cannot be stale ( a 30 minute delay is stale ). I have > >> tried > >> > out MinHash clustering and that is scalable, but without a "degree of > >> > association" with multiple clusters any user may belong to , it seems > >> less > >> > tight that pure item based ( and thus similarity probability ) > algorithm. > >> > > >> > Any ideas how we pull this off., where > >> > > >> > * The item churn is frequent. New items enter the dataset all the > time. > >> > * There is no "preference" apart from opt in. > >> > * Very frequent anonymous users enter the system almost all the time. > >> > > >> > > >> > Scale is very important. > >> > > >> > I am tending towards MinHash with additional algorithms that are > executed > >> > offline and co occurance. > >> > > >> > > >
