Re: MinHash/ItemBased

Sebastian Schelter Tue, 25 Oct 2011 07:42:56 -0700

Hello Vishal,

How many interactions do you have between those users and items? I'd
definitely recommend to you try out the current trunk of Mahout as the
performance of RecommenderJob has significantly been improved.


The most important parameter (performancewise) is the newly introduced
--maxPrefsPerUserInItemSimilarity which causes RecommenderJob to
downsample "power" users that can slow down the recommendation computing
process (without contributing much to the quality of the results).

I'm currently running tests with a patched version of the new
RecommenderJob on a Yahoo Music dataset consisting of more than 700
million ratings of 2 million users towards 140 thousand items (seems to
be similar to your user/item ratio) and seeing nice results with that
even though I run it on a small research cluster.

If the phase after the item similarity computation takes too long (I
think you suspected this) than you can also try the patch from
https://issues.apache.org/jira/browse/MAHOUT-827 that broadcasts the
similarity matrix via distributed cache and computes the recommendations
in a map-only job. This could work well for your usecase as you have a
relatively small number of items.

--sebastian



On 25.10.2011 16:27, Vishal Santoshi wrote:
> The data is big as in for a single day ( and I picked up an arbitrary day )
> 
> 8,335,013  users.
> 256,010      distinct Items.
> 
> I am using the Item Based Recommender ( The RecommenderJob ) , with no
> Preference ( opt in is a signal of preference , multiple opt ins are
> considered 1 )
> 
>  <main-class>com.nytimes.computing.mahout.JobDriver</main-class>
>             <arg>recommender</arg>
>             <arg>--input</arg>
>             <arg>${out}/items/bag</arg>
>             <arg>--output</arg>
>             <arg>${out}/items_similarity</arg>
>             <arg>-u</arg>
>             <arg>${out}/items/users/part-r-00000</arg>
>             <arg>-b</arg>
>             <arg>-n</arg>
>             <arg>2</arg>
>             <arg>--similarityClassname</arg>
> 
>  
> <arg>org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity</arg>
>             <arg>--tempDir</arg>
>    <arg>${out}/temp</arg>
> 
> Of course the Recommendations are for every user and thus
> the RecommenderJob-PartialMultiplyMapper-AggregateAndRecommendReducer is the
> most expensive of all.
> Further , not sure why the user file is taken in as a Distributed File
> especially when it may actually be a bigger file that a typical TaskTracker
> JVM memory limit.
> 
> 
> 
> In case of MinHash , MinHashDriver
> 
>        <java>
>             <job-tracker>${jobTracker}</job-tracker>
>             <name-node>${nameNode}</name-node>
>              <prepare>
>                 <delete path="${out}/minhash"/>
>             </prepare>
>             <configuration>
>                 <property>
>                     <name>mapred.job.queue.name</name>
>                     <value>${queueName}</value>
>                 </property>
>             </configuration>
>             <main-class>com.nytimes.computing.mahout.JobDriver</main-class>
>             <arg>minhash_local</arg>
>             <arg>--input</arg>
>             <arg>${out}/bag</arg>
>             <arg>--output</arg>
>             <arg>${out}/minhash</arg>
>             <arg>--keyGroups</arg>  <!-- Key Groups -->
>             <arg>2</arg>
>             <arg>-r</arg>  <!-- Number of Reducers -->
>             <arg>40</arg>
>             <arg>--minClusterSize</arg> <!-- A legitimate cluster must have
> this number of members -->
>             <arg>5</arg>
>             <arg>--hashType</arg> <!-- murmur and linear are the other 2
> options -->
>             <arg>polynomial</arg>
>         </java>
> 
> This of course scales. I still have to work with the clusters created and a
> fair amount of work has to be done to figure out which cluster is relevant.
> 
> 
> A week  of data in this case created the MinHash on our cluster in about 20
> minutes.
> 
> 
> Regards.
> 
> 
> On Tue, Oct 25, 2011 at 10:07 AM, Sean Owen <[email protected]> wrote:
> 
>> Can you put any more numbers around this? how slow is slow, how big is big?
>> What part of Mahout are you using -- or are you using Mahout?
>>
>> Item-based recommendation sounds fine. Anonymous users aren't a
>> problem as long as you can distinguish them reasonably.
>> I think your challenge is to have a data model that quickly drops out
>> data from old items and can bring new items in.
>>
>> Is this small enough to do in memory? that's the simple, easy place to
>> start.
>>
>> On Tue, Oct 25, 2011 at 2:59 PM, Vishal Santoshi
>> <[email protected]> wrote:
>>> Hello Folks,
>>>                  The Item Based Recommendations for my dataset is
>>> excruciatingly slow on a 8 node cluster. Yes the number of items is big
>> and
>>> the dataset churn does not allow for a long asynchronous process.
>>> Recommendations cannot be stale ( a 30 minute delay is stale ). I have
>> tried
>>> out MinHash clustering and that is scalable, but without a "degree of
>>> association" with multiple clusters any user may belong to , it seems
>> less
>>> tight that pure item based ( and thus similarity probability ) algorithm.
>>>
>>> Any ideas how we pull this off., where
>>>
>>> * The item churn is frequent. New items enter the dataset all the time.
>>> * There is no "preference" apart from opt in.
>>> * Very frequent anonymous users enter the system almost all the time.
>>>
>>>
>>> Scale is very important.
>>>
>>> I am tending towards MinHash with additional algorithms that are executed
>>> offline and co occurance.
>>>
>>
>

Re: MinHash/ItemBased

Reply via email to