The data is big as in for a single day ( and I picked up an arbitrary day )
8,335,013 users.
256,010 distinct Items.
I am using the Item Based Recommender ( The RecommenderJob ) , with no
Preference ( opt in is a signal of preference , multiple opt ins are
considered 1 )
<main-class>com.nytimes.computing.mahout.JobDriver</main-class>
<arg>recommender</arg>
<arg>--input</arg>
<arg>${out}/items/bag</arg>
<arg>--output</arg>
<arg>${out}/items_similarity</arg>
<arg>-u</arg>
<arg>${out}/items/users/part-r-00000</arg>
<arg>-b</arg>
<arg>-n</arg>
<arg>2</arg>
<arg>--similarityClassname</arg>
<arg>org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity</arg>
<arg>--tempDir</arg>
<arg>${out}/temp</arg>
Of course the Recommendations are for every user and thus
the RecommenderJob-PartialMultiplyMapper-AggregateAndRecommendReducer is the
most expensive of all.
Further , not sure why the user file is taken in as a Distributed File
especially when it may actually be a bigger file that a typical TaskTracker
JVM memory limit.
In case of MinHash , MinHashDriver
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${out}/minhash"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>com.nytimes.computing.mahout.JobDriver</main-class>
<arg>minhash_local</arg>
<arg>--input</arg>
<arg>${out}/bag</arg>
<arg>--output</arg>
<arg>${out}/minhash</arg>
<arg>--keyGroups</arg> <!-- Key Groups -->
<arg>2</arg>
<arg>-r</arg> <!-- Number of Reducers -->
<arg>40</arg>
<arg>--minClusterSize</arg> <!-- A legitimate cluster must have
this number of members -->
<arg>5</arg>
<arg>--hashType</arg> <!-- murmur and linear are the other 2
options -->
<arg>polynomial</arg>
</java>
This of course scales. I still have to work with the clusters created and a
fair amount of work has to be done to figure out which cluster is relevant.
A week of data in this case created the MinHash on our cluster in about 20
minutes.
Regards.
On Tue, Oct 25, 2011 at 10:07 AM, Sean Owen <[email protected]> wrote:
> Can you put any more numbers around this? how slow is slow, how big is big?
> What part of Mahout are you using -- or are you using Mahout?
>
> Item-based recommendation sounds fine. Anonymous users aren't a
> problem as long as you can distinguish them reasonably.
> I think your challenge is to have a data model that quickly drops out
> data from old items and can bring new items in.
>
> Is this small enough to do in memory? that's the simple, easy place to
> start.
>
> On Tue, Oct 25, 2011 at 2:59 PM, Vishal Santoshi
> <[email protected]> wrote:
> > Hello Folks,
> > The Item Based Recommendations for my dataset is
> > excruciatingly slow on a 8 node cluster. Yes the number of items is big
> and
> > the dataset churn does not allow for a long asynchronous process.
> > Recommendations cannot be stale ( a 30 minute delay is stale ). I have
> tried
> > out MinHash clustering and that is scalable, but without a "degree of
> > association" with multiple clusters any user may belong to , it seems
> less
> > tight that pure item based ( and thus similarity probability ) algorithm.
> >
> > Any ideas how we pull this off., where
> >
> > * The item churn is frequent. New items enter the dataset all the time.
> > * There is no "preference" apart from opt in.
> > * Very frequent anonymous users enter the system almost all the time.
> >
> >
> > Scale is very important.
> >
> > I am tending towards MinHash with additional algorithms that are executed
> > offline and co occurance.
> >
>