Re: MinHash/ItemBased

Vishal Santoshi Tue, 25 Oct 2011 08:15:24 -0700

The number of interactions are about 30 million a day.

>>The most important parameter (performancewise) is the newly introduced
>> --maxPrefsPerUserInItemSimilarity


Will use it and keep you posted.

>> broadcasts the similarity matrix via distributed cache

This can be done. As the similarity matrix is unlikely blow up ( the items
do not exponentially blow up ).

>> definitely recommend to you try out the current trunk of Mahout

I have build up mahout with the latest svn co
mahout-core-0.6-SNAPSHOT.jar



On Tue, Oct 25, 2011 at 10:42 AM, Sebastian Schelter <[email protected]> wrote:

> Hello Vishal,
>
> How many interactions do you have between those users and items? I'd
> definitely recommend to you try out the current trunk of Mahout as the
> performance of RecommenderJob has significantly been improved.
>
> The most important parameter (performancewise) is the newly introduced
> --maxPrefsPerUserInItemSimilarity which causes RecommenderJob to
> downsample "power" users that can slow down the recommendation computing
> process (without contributing much to the quality of the results).
>
> I'm currently running tests with a patched version of the new
> RecommenderJob on a Yahoo Music dataset consisting of more than 700
> million ratings of 2 million users towards 140 thousand items (seems to
> be similar to your user/item ratio) and seeing nice results with that
> even though I run it on a small research cluster.
>
> If the phase after the item similarity computation takes too long (I
> think you suspected this) than you can also try the patch from
> https://issues.apache.org/jira/browse/MAHOUT-827 that broadcasts the
> similarity matrix via distributed cache and computes the recommendations
> in a map-only job. This could work well for your usecase as you have a
> relatively small number of items.
>
> --sebastian
>
>
>
> On 25.10.2011 16:27, Vishal Santoshi wrote:
> > The data is big as in for a single day ( and I picked up an arbitrary day
> )
> >
> > 8,335,013  users.
> > 256,010      distinct Items.
> >
> > I am using the Item Based Recommender ( The RecommenderJob ) , with no
> > Preference ( opt in is a signal of preference , multiple opt ins are
> > considered 1 )
> >
> >  <main-class>com.nytimes.computing.mahout.JobDriver</main-class>
> >             <arg>recommender</arg>
> >             <arg>--input</arg>
> >             <arg>${out}/items/bag</arg>
> >             <arg>--output</arg>
> >             <arg>${out}/items_similarity</arg>
> >             <arg>-u</arg>
> >             <arg>${out}/items/users/part-r-00000</arg>
> >             <arg>-b</arg>
> >             <arg>-n</arg>
> >             <arg>2</arg>
> >             <arg>--similarityClassname</arg>
> >
> >
>  
> <arg>org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity</arg>
> >             <arg>--tempDir</arg>
> >    <arg>${out}/temp</arg>
> >
> > Of course the Recommendations are for every user and thus
> > the RecommenderJob-PartialMultiplyMapper-AggregateAndRecommendReducer is
> the
> > most expensive of all.
> > Further , not sure why the user file is taken in as a Distributed File
> > especially when it may actually be a bigger file that a typical
> TaskTracker
> > JVM memory limit.
> >
> >
> >
> > In case of MinHash , MinHashDriver
> >
> >        <java>
> >             <job-tracker>${jobTracker}</job-tracker>
> >             <name-node>${nameNode}</name-node>
> >              <prepare>
> >                 <delete path="${out}/minhash"/>
> >             </prepare>
> >             <configuration>
> >                 <property>
> >                     <name>mapred.job.queue.name</name>
> >                     <value>${queueName}</value>
> >                 </property>
> >             </configuration>
> >
> <main-class>com.nytimes.computing.mahout.JobDriver</main-class>
> >             <arg>minhash_local</arg>
> >             <arg>--input</arg>
> >             <arg>${out}/bag</arg>
> >             <arg>--output</arg>
> >             <arg>${out}/minhash</arg>
> >             <arg>--keyGroups</arg>  <!-- Key Groups -->
> >             <arg>2</arg>
> >             <arg>-r</arg>  <!-- Number of Reducers -->
> >             <arg>40</arg>
> >             <arg>--minClusterSize</arg> <!-- A legitimate cluster must
> have
> > this number of members -->
> >             <arg>5</arg>
> >             <arg>--hashType</arg> <!-- murmur and linear are the other 2
> > options -->
> >             <arg>polynomial</arg>
> >         </java>
> >
> > This of course scales. I still have to work with the clusters created and
> a
> > fair amount of work has to be done to figure out which cluster is
> relevant.
> >
> >
> > A week  of data in this case created the MinHash on our cluster in about
> 20
> > minutes.
> >
> >
> > Regards.
> >
> >
> > On Tue, Oct 25, 2011 at 10:07 AM, Sean Owen <[email protected]> wrote:
> >
> >> Can you put any more numbers around this? how slow is slow, how big is
> big?
> >> What part of Mahout are you using -- or are you using Mahout?
> >>
> >> Item-based recommendation sounds fine. Anonymous users aren't a
> >> problem as long as you can distinguish them reasonably.
> >> I think your challenge is to have a data model that quickly drops out
> >> data from old items and can bring new items in.
> >>
> >> Is this small enough to do in memory? that's the simple, easy place to
> >> start.
> >>
> >> On Tue, Oct 25, 2011 at 2:59 PM, Vishal Santoshi
> >> <[email protected]> wrote:
> >>> Hello Folks,
> >>>                  The Item Based Recommendations for my dataset is
> >>> excruciatingly slow on a 8 node cluster. Yes the number of items is big
> >> and
> >>> the dataset churn does not allow for a long asynchronous process.
> >>> Recommendations cannot be stale ( a 30 minute delay is stale ). I have
> >> tried
> >>> out MinHash clustering and that is scalable, but without a "degree of
> >>> association" with multiple clusters any user may belong to , it seems
> >> less
> >>> tight that pure item based ( and thus similarity probability )
> algorithm.
> >>>
> >>> Any ideas how we pull this off., where
> >>>
> >>> * The item churn is frequent. New items enter the dataset all the time.
> >>> * There is no "preference" apart from opt in.
> >>> * Very frequent anonymous users enter the system almost all the time.
> >>>
> >>>
> >>> Scale is very important.
> >>>
> >>> I am tending towards MinHash with additional algorithms that are
> executed
> >>> offline and co occurance.
> >>>
> >>
> >
>
>

Re: MinHash/ItemBased

Reply via email to