Yep, Please keep me posted.
BTW , this is exactly why MinHash picked my curiosity and that seems to be
affirmed by

http://www.datawrangling.com/google-paper-on-parallel-em-algorithm-using-mapreduce

MinHash scales , such that the offline periodic component ( based on
hadoop/mahout yes mahout has a Minhash based clustering Driver )  seems
promising.
Again please keep the forum posted on how you go about doing this.

Regards,

Vishal.

On Tue, Oct 25, 2011 at 11:55 AM, Sean Owen <[email protected]> wrote:

> Oh I see, right.
>
> Well, one general strategy is to use Hadoop to compute the
> recommendations regularly, but not nearly in real-time. Then, use the
> latest data to imperfectly update the recommendations in real-time.
> So, you always have slightly stale recommendations, and item-item
> similarities to fall back on, and are reloading those periodically.
> Then you're trying to update any recently changed item or user in
> real-time using item-based recommendation, which can be fast.
>
> It's a really big topic in its own right, and there's no complete
> answer for you here, but you can piece this together from Mahout
> rather than build it from scratch.)
>
> (This is more or less exactly what I have been working on separately,
> a hybrid Hadoop-based / real-time recommender that can handle this
> scale but also respond reasonably to new data.)
>
> On Tue, Oct 25, 2011 at 4:44 PM, Vishal Santoshi
> <[email protected]> wrote:
> > They are all active in a day. I am talking about 8.3 million active users
> a
> > day.
> > A significant fraction of them will be new users ( say about 2-3 million
> of
> > them ).
> > Further the churn on items is likely to make historical recommendations
> > obsolete.
> > Thus if I have recommendations that were good of user A yesterday, they
> are
> > likely to be far less a weight as of today.
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Oct 25, 2011 at 11:32 AM, Sean Owen <[email protected]> wrote:
> >
> >> On Tue, Oct 25, 2011 at 4:08 PM, Vishal Santoshi
> >> <[email protected]> wrote:
> >> > In our case the preferences is  a user clicking on an article ( which
> >> > doubles as an item ).
> >> > And these articles are introduced at a frequent rate. Thus the number
> of
> >> new
> >> > items that
> >> > occur in the dataset has a very frequent churn and thus not
> necessarily
> >> > having any history.
> >> > Of course we need to recommend the latest item.
> >>
> >> OK, but I'm still not seeing why all users need an update every time.
> >> Surely most of the 8.3M users aren't even active in a given day.
> >>
> >
>

Reply via email to