Yep, Please keep me posted. BTW , this is exactly why MinHash picked my curiosity and that seems to be affirmed by
http://www.datawrangling.com/google-paper-on-parallel-em-algorithm-using-mapreduce MinHash scales , such that the offline periodic component ( based on hadoop/mahout yes mahout has a Minhash based clustering Driver ) seems promising. Again please keep the forum posted on how you go about doing this. Regards, Vishal. On Tue, Oct 25, 2011 at 11:55 AM, Sean Owen <[email protected]> wrote: > Oh I see, right. > > Well, one general strategy is to use Hadoop to compute the > recommendations regularly, but not nearly in real-time. Then, use the > latest data to imperfectly update the recommendations in real-time. > So, you always have slightly stale recommendations, and item-item > similarities to fall back on, and are reloading those periodically. > Then you're trying to update any recently changed item or user in > real-time using item-based recommendation, which can be fast. > > It's a really big topic in its own right, and there's no complete > answer for you here, but you can piece this together from Mahout > rather than build it from scratch.) > > (This is more or less exactly what I have been working on separately, > a hybrid Hadoop-based / real-time recommender that can handle this > scale but also respond reasonably to new data.) > > On Tue, Oct 25, 2011 at 4:44 PM, Vishal Santoshi > <[email protected]> wrote: > > They are all active in a day. I am talking about 8.3 million active users > a > > day. > > A significant fraction of them will be new users ( say about 2-3 million > of > > them ). > > Further the churn on items is likely to make historical recommendations > > obsolete. > > Thus if I have recommendations that were good of user A yesterday, they > are > > likely to be far less a weight as of today. > > > > > > > > > > > > > > > > > > On Tue, Oct 25, 2011 at 11:32 AM, Sean Owen <[email protected]> wrote: > > > >> On Tue, Oct 25, 2011 at 4:08 PM, Vishal Santoshi > >> <[email protected]> wrote: > >> > In our case the preferences is a user clicking on an article ( which > >> > doubles as an item ). > >> > And these articles are introduced at a frequent rate. Thus the number > of > >> new > >> > items that > >> > occur in the dataset has a very frequent churn and thus not > necessarily > >> > having any history. > >> > Of course we need to recommend the latest item. > >> > >> OK, but I'm still not seeing why all users need an update every time. > >> Surely most of the 8.3M users aren't even active in a given day. > >> > > >
