Re: MinHash/ItemBased

Sebastian Schelter Tue, 25 Oct 2011 09:16:40 -0700

The Google News paper you cite follows an approach very different from
the one implemented in RecommenderJob.


Their approach has a very high complexity and they chose to use it
because of the extreme item churn in the news domain.

The techniques in the Google paper (MinHash and PLSI) are used compute
user similarities (clusters of users, MinHash just looks at the ratio of
co-read stories, PLSI tries to cluster the users according to some
latent features in their interactions). A third component tracks co-read
stories in realtime and a user is recommended stories that are co-read
from other users in his clusters.

--sebastian

On 25.10.2011 18:07, Vishal Santoshi wrote:
> Yep, Please keep me posted.
> BTW , this is exactly why MinHash picked my curiosity and that seems to be
> affirmed by
> 
> http://www.datawrangling.com/google-paper-on-parallel-em-algorithm-using-mapreduce
> 
> MinHash scales , such that the offline periodic component ( based on
> hadoop/mahout yes mahout has a Minhash based clustering Driver )  seems
> promising.
> Again please keep the forum posted on how you go about doing this.
> 
> Regards,
> 
> Vishal.
> 
> On Tue, Oct 25, 2011 at 11:55 AM, Sean Owen <[email protected]> wrote:
> 
>> Oh I see, right.
>>
>> Well, one general strategy is to use Hadoop to compute the
>> recommendations regularly, but not nearly in real-time. Then, use the
>> latest data to imperfectly update the recommendations in real-time.
>> So, you always have slightly stale recommendations, and item-item
>> similarities to fall back on, and are reloading those periodically.
>> Then you're trying to update any recently changed item or user in
>> real-time using item-based recommendation, which can be fast.
>>
>> It's a really big topic in its own right, and there's no complete
>> answer for you here, but you can piece this together from Mahout
>> rather than build it from scratch.)
>>
>> (This is more or less exactly what I have been working on separately,
>> a hybrid Hadoop-based / real-time recommender that can handle this
>> scale but also respond reasonably to new data.)
>>
>> On Tue, Oct 25, 2011 at 4:44 PM, Vishal Santoshi
>> <[email protected]> wrote:
>>> They are all active in a day. I am talking about 8.3 million active users
>> a
>>> day.
>>> A significant fraction of them will be new users ( say about 2-3 million
>> of
>>> them ).
>>> Further the churn on items is likely to make historical recommendations
>>> obsolete.
>>> Thus if I have recommendations that were good of user A yesterday, they
>> are
>>> likely to be far less a weight as of today.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Oct 25, 2011 at 11:32 AM, Sean Owen <[email protected]> wrote:
>>>
>>>> On Tue, Oct 25, 2011 at 4:08 PM, Vishal Santoshi
>>>> <[email protected]> wrote:
>>>>> In our case the preferences is  a user clicking on an article ( which
>>>>> doubles as an item ).
>>>>> And these articles are introduced at a frequent rate. Thus the number
>> of
>>>> new
>>>>> items that
>>>>> occur in the dataset has a very frequent churn and thus not
>> necessarily
>>>>> having any history.
>>>>> Of course we need to recommend the latest item.
>>>>
>>>> OK, but I'm still not seeing why all users need an update every time.
>>>> Surely most of the 8.3M users aren't even active in a given day.
>>>>
>>>
>>
>

Re: MinHash/ItemBased

Reply via email to