Exactly as you said. And as you may have deciphered the domain I am working for is very akin to google's. MinHash ( and thus Jacquard's similarity ) does scale as it reduces users cluster computation to user's data, but has different set of issues and thus the PLSI as well as the co-occurance ( and that makes us go towards NOSQL Cassandra/MongoDB or HBase ). For me Item Based recommendation is fairly precise with less or no complexity ( apart from the scale issue ) and thus pretty straight forward.
As Sean has predicted, the problem ( we and google face ) is not essentially tailor made for Item Based Recommendation. A hybrid has to be found IMHO. On Tue, Oct 25, 2011 at 12:16 PM, Sebastian Schelter <[email protected]> wrote: > The Google News paper you cite follows an approach very different from > the one implemented in RecommenderJob. > > Their approach has a very high complexity and they chose to use it > because of the extreme item churn in the news domain. > > The techniques in the Google paper (MinHash and PLSI) are used compute > user similarities (clusters of users, MinHash just looks at the ratio of > co-read stories, PLSI tries to cluster the users according to some > latent features in their interactions). A third component tracks co-read > stories in realtime and a user is recommended stories that are co-read > from other users in his clusters. > > --sebastian > > On 25.10.2011 18:07, Vishal Santoshi wrote: > > Yep, Please keep me posted. > > BTW , this is exactly why MinHash picked my curiosity and that seems to > be > > affirmed by > > > > > http://www.datawrangling.com/google-paper-on-parallel-em-algorithm-using-mapreduce > > > > MinHash scales , such that the offline periodic component ( based on > > hadoop/mahout yes mahout has a Minhash based clustering Driver ) seems > > promising. > > Again please keep the forum posted on how you go about doing this. > > > > Regards, > > > > Vishal. > > > > On Tue, Oct 25, 2011 at 11:55 AM, Sean Owen <[email protected]> wrote: > > > >> Oh I see, right. > >> > >> Well, one general strategy is to use Hadoop to compute the > >> recommendations regularly, but not nearly in real-time. Then, use the > >> latest data to imperfectly update the recommendations in real-time. > >> So, you always have slightly stale recommendations, and item-item > >> similarities to fall back on, and are reloading those periodically. > >> Then you're trying to update any recently changed item or user in > >> real-time using item-based recommendation, which can be fast. > >> > >> It's a really big topic in its own right, and there's no complete > >> answer for you here, but you can piece this together from Mahout > >> rather than build it from scratch.) > >> > >> (This is more or less exactly what I have been working on separately, > >> a hybrid Hadoop-based / real-time recommender that can handle this > >> scale but also respond reasonably to new data.) > >> > >> On Tue, Oct 25, 2011 at 4:44 PM, Vishal Santoshi > >> <[email protected]> wrote: > >>> They are all active in a day. I am talking about 8.3 million active > users > >> a > >>> day. > >>> A significant fraction of them will be new users ( say about 2-3 > million > >> of > >>> them ). > >>> Further the churn on items is likely to make historical recommendations > >>> obsolete. > >>> Thus if I have recommendations that were good of user A yesterday, they > >> are > >>> likely to be far less a weight as of today. > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Tue, Oct 25, 2011 at 11:32 AM, Sean Owen <[email protected]> wrote: > >>> > >>>> On Tue, Oct 25, 2011 at 4:08 PM, Vishal Santoshi > >>>> <[email protected]> wrote: > >>>>> In our case the preferences is a user clicking on an article ( which > >>>>> doubles as an item ). > >>>>> And these articles are introduced at a frequent rate. Thus the number > >> of > >>>> new > >>>>> items that > >>>>> occur in the dataset has a very frequent churn and thus not > >> necessarily > >>>>> having any history. > >>>>> Of course we need to recommend the latest item. > >>>> > >>>> OK, but I'm still not seeing why all users need an update every time. > >>>> Surely most of the 8.3M users aren't even active in a given day. > >>>> > >>> > >> > > > >
