The data is big as in for a single day ( and I picked up an arbitrary day )

8,335,013  users.
256,010      distinct Items.

I am using the Item Based Recommender ( The RecommenderJob ) , with no
Preference ( opt in is a signal of preference , multiple opt ins are
considered 1 )

 <main-class>com.nytimes.computing.mahout.JobDriver</main-class>
            <arg>recommender</arg>
            <arg>--input</arg>
            <arg>${out}/items/bag</arg>
            <arg>--output</arg>
            <arg>${out}/items_similarity</arg>
            <arg>-u</arg>
            <arg>${out}/items/users/part-r-00000</arg>
            <arg>-b</arg>
            <arg>-n</arg>
            <arg>2</arg>
            <arg>--similarityClassname</arg>

 
<arg>org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity</arg>
            <arg>--tempDir</arg>
   <arg>${out}/temp</arg>

Of course the Recommendations are for every user and thus
the RecommenderJob-PartialMultiplyMapper-AggregateAndRecommendReducer is the
most expensive of all.
Further , not sure why the user file is taken in as a Distributed File
especially when it may actually be a bigger file that a typical TaskTracker
JVM memory limit.



In case of MinHash , MinHashDriver

       <java>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
             <prepare>
                <delete path="${out}/minhash"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <main-class>com.nytimes.computing.mahout.JobDriver</main-class>
            <arg>minhash_local</arg>
            <arg>--input</arg>
            <arg>${out}/bag</arg>
            <arg>--output</arg>
            <arg>${out}/minhash</arg>
            <arg>--keyGroups</arg>  <!-- Key Groups -->
            <arg>2</arg>
            <arg>-r</arg>  <!-- Number of Reducers -->
            <arg>40</arg>
            <arg>--minClusterSize</arg> <!-- A legitimate cluster must have
this number of members -->
            <arg>5</arg>
            <arg>--hashType</arg> <!-- murmur and linear are the other 2
options -->
            <arg>polynomial</arg>
        </java>

This of course scales. I still have to work with the clusters created and a
fair amount of work has to be done to figure out which cluster is relevant.


A week  of data in this case created the MinHash on our cluster in about 20
minutes.


Regards.


On Tue, Oct 25, 2011 at 10:07 AM, Sean Owen <[email protected]> wrote:

> Can you put any more numbers around this? how slow is slow, how big is big?
> What part of Mahout are you using -- or are you using Mahout?
>
> Item-based recommendation sounds fine. Anonymous users aren't a
> problem as long as you can distinguish them reasonably.
> I think your challenge is to have a data model that quickly drops out
> data from old items and can bring new items in.
>
> Is this small enough to do in memory? that's the simple, easy place to
> start.
>
> On Tue, Oct 25, 2011 at 2:59 PM, Vishal Santoshi
> <[email protected]> wrote:
> > Hello Folks,
> >                  The Item Based Recommendations for my dataset is
> > excruciatingly slow on a 8 node cluster. Yes the number of items is big
> and
> > the dataset churn does not allow for a long asynchronous process.
> > Recommendations cannot be stale ( a 30 minute delay is stale ). I have
> tried
> > out MinHash clustering and that is scalable, but without a "degree of
> > association" with multiple clusters any user may belong to , it seems
> less
> > tight that pure item based ( and thus similarity probability ) algorithm.
> >
> > Any ideas how we pull this off., where
> >
> > * The item churn is frequent. New items enter the dataset all the time.
> > * There is no "preference" apart from opt in.
> > * Very frequent anonymous users enter the system almost all the time.
> >
> >
> > Scale is very important.
> >
> > I am tending towards MinHash with additional algorithms that are executed
> > offline and co occurance.
> >
>

Reply via email to