Re: MinHash/ItemBased

Vishal Santoshi Tue, 25 Oct 2011 08:08:51 -0700

>> But, I think this all works much much better if you can only recompute
>> users that have changed their prefs.


In our case the preferences is  a user clicking on an article ( which
doubles as an item ).
And these articles are introduced at a frequent rate. Thus the number of new
items that
occur in the dataset has a very frequent churn and thus not necessarily
having any history.
Of course we need to recommend the latest item.

So the issues are

* We have Users that have had an historical click history.
* We have new items that will potentially click on.
* The brand new Users that may/have to be checked against Users that have a
history ( to find similarity ).
* Recommendation on old items though OK have staleness associated.



Unlike Amazon or Net Flicks , staleness, churn etc is a real big deal.
I realize the overhead for Hadoop, yet the data is cumulative as in we would
rather go for a sliding window
of 3 weeks. We do want to do it every 3-4 hours for every user ( A user will
come back any time ).
We do realize that the offline part of the computation will likely to be a
part of the solution.

What would you do based on what our requirements are.

For me

* Offline clustering.   ( PLSI + Minhash/Recommendation ( Item )
* Cooccurance on Item ( for new Users who have no history ).






On Tue, Oct 25, 2011 at 10:43 AM, Sean Owen <[email protected]> wrote:

> Why recommend for all users -- why not just new ones or ones that have
> been updated? Yes, you're not intended to list all users into memory
> if using "-u".
>
> A very crude rule of thumb is that you can compute about 100 recs per
> second on a normal machine, normal-sized data (no Hadoop). 8 machines
> would crank through 8.3M recs in 3 hours at best. Hadoop is going to
> be 3-4x slower than this due to its overheads.
>
> This pipeline probably takes 10 minutes or so to finish even with 0
> input; that's the Hadoop overhead. If you're trying to finish
> computations in minutes, Hadoop probably isn't suitable.
>
> But, I think this all works much much better if you can only recompute
> users that have changed their prefs.
>
>
> On Tue, Oct 25, 2011 at 3:27 PM, Vishal Santoshi
> <[email protected]> wrote:
> > The data is big as in for a single day ( and I picked up an arbitrary day
> )
> >
> > 8,335,013  users.
> > 256,010      distinct Items.
> >
> > I am using the Item Based Recommender ( The RecommenderJob ) , with no
> > Preference ( opt in is a signal of preference , multiple opt ins are
> > considered 1 )
> >
> >  <main-class>com.nytimes.computing.mahout.JobDriver</main-class>
> >            <arg>recommender</arg>
> >            <arg>--input</arg>
> >            <arg>${out}/items/bag</arg>
> >            <arg>--output</arg>
> >            <arg>${out}/items_similarity</arg>
> >            <arg>-u</arg>
> >            <arg>${out}/items/users/part-r-00000</arg>
> >            <arg>-b</arg>
> >            <arg>-n</arg>
> >            <arg>2</arg>
> >            <arg>--similarityClassname</arg>
> >
> >
>  
> <arg>org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity</arg>
> >            <arg>--tempDir</arg>
> >   <arg>${out}/temp</arg>
> >
> > Of course the Recommendations are for every user and thus
> > the RecommenderJob-PartialMultiplyMapper-AggregateAndRecommendReducer is
> the
> > most expensive of all.
> > Further , not sure why the user file is taken in as a Distributed File
> > especially when it may actually be a bigger file that a typical
> TaskTracker
> > JVM memory limit.
> >
> >
> >
> > In case of MinHash , MinHashDriver
> >
> >       <java>
> >            <job-tracker>${jobTracker}</job-tracker>
> >            <name-node>${nameNode}</name-node>
> >             <prepare>
> >                <delete path="${out}/minhash"/>
> >            </prepare>
> >            <configuration>
> >                <property>
> >                    <name>mapred.job.queue.name</name>
> >                    <value>${queueName}</value>
> >                </property>
> >            </configuration>
> >
>  <main-class>com.nytimes.computing.mahout.JobDriver</main-class>
> >            <arg>minhash_local</arg>
> >            <arg>--input</arg>
> >            <arg>${out}/bag</arg>
> >            <arg>--output</arg>
> >            <arg>${out}/minhash</arg>
> >            <arg>--keyGroups</arg>  <!-- Key Groups -->
> >            <arg>2</arg>
> >            <arg>-r</arg>  <!-- Number of Reducers -->
> >            <arg>40</arg>
> >            <arg>--minClusterSize</arg> <!-- A legitimate cluster must
> have
> > this number of members -->
> >            <arg>5</arg>
> >            <arg>--hashType</arg> <!-- murmur and linear are the other 2
> > options -->
> >            <arg>polynomial</arg>
> >        </java>
> >
> > This of course scales. I still have to work with the clusters created and
> a
> > fair amount of work has to be done to figure out which cluster is
> relevant.
> >
> >
> > A week  of data in this case created the MinHash on our cluster in about
> 20
> > minutes.
> >
> >
> > Regards.
> >
> >
> > On Tue, Oct 25, 2011 at 10:07 AM, Sean Owen <[email protected]> wrote:
> >
> >> Can you put any more numbers around this? how slow is slow, how big is
> big?
> >> What part of Mahout are you using -- or are you using Mahout?
> >>
> >> Item-based recommendation sounds fine. Anonymous users aren't a
> >> problem as long as you can distinguish them reasonably.
> >> I think your challenge is to have a data model that quickly drops out
> >> data from old items and can bring new items in.
> >>
> >> Is this small enough to do in memory? that's the simple, easy place to
> >> start.
> >>
> >> On Tue, Oct 25, 2011 at 2:59 PM, Vishal Santoshi
> >> <[email protected]> wrote:
> >> > Hello Folks,
> >> >                  The Item Based Recommendations for my dataset is
> >> > excruciatingly slow on a 8 node cluster. Yes the number of items is
> big
> >> and
> >> > the dataset churn does not allow for a long asynchronous process.
> >> > Recommendations cannot be stale ( a 30 minute delay is stale ). I have
> >> tried
> >> > out MinHash clustering and that is scalable, but without a "degree of
> >> > association" with multiple clusters any user may belong to , it seems
> >> less
> >> > tight that pure item based ( and thus similarity probability )
> algorithm.
> >> >
> >> > Any ideas how we pull this off., where
> >> >
> >> > * The item churn is frequent. New items enter the dataset all the
> time.
> >> > * There is no "preference" apart from opt in.
> >> > * Very frequent anonymous users enter the system almost all the time.
> >> >
> >> >
> >> > Scale is very important.
> >> >
> >> > I am tending towards MinHash with additional algorithms that are
> executed
> >> > offline and co occurance.
> >> >
> >>
> >
>

Re: MinHash/ItemBased

Reply via email to