Hi Daniel,

My view is this: I think you can pretty safely down-sample power users
like it is done in https://issues.apache.org/jira/browse/MAHOUT-914
I did some experiments on the movielens1M dataset that showed that you
get a negligible error given you look at enough interactions per user:

https://issues.apache.org/jira/secure/attachment/12506028/downsampling.png

I could also verify this on the movielens10M dataset. I think this kind
of sampling works because the distribution of interactions with items in
the power-users and in the whole dataset is very similar. Therefore you
don't really learn anything new from the 'power-users'. The
'power-users' might also be crawlers or people sharing accounts in practice.

However, I am not sure what happens when you also sample the number of
items you look at. If I had to decide, I'd rather follow Ted's advice
and kill super-popular items, as they are not helpful per-se.

But if the additional item sampling helps in your usecase, I don't
oppose including it in Mahout. I think its good to have a variety of
candidate item strategies. You should however do some experimenting to
see how much the sampling affects quality. An A/B test in a real
application would be the best thing to do.

--sebastian



On 04.12.2011 13:12, Daniel Zohar wrote:
> Actually I was referring to Sebastian's. I haven't seen you committedI can 
> anything to SamplingCandidateItemsStrategy. Can you tell me in which classI 
> can 
> the change appears?
> 
> On Sun, Dec 4, 2011 at 2:06 PM, Sean Owen <[email protected]> wrote:
> 
>> Are you referring to my patch, MAHOUT-910?
>>
>> It does let you specify a hard cap, really -- if you place a limit of X,
>> then at most X^2 item-item associations come out. Before you could not
>> bound the result, really, since one user could rate a lot of items.
>>
>> I think it's slightly more efficient and unbiased as users with few ratings
>> will not have their ratings sampled out, and all users are equally likely
>> to be sampled out.
>>
>> What do you think?
>> Yes you could easily add a secondary cap though as a final filter.
>>
>> On Sun, Dec 4, 2011 at 11:43 AM, Daniel Zohar <[email protected]> wrote:
>>
>>> Combining the latest commits with my
>>> optimized-SamplingCandidateItemsStrategy (http://pastebin.com/6n9C8Pw1)
>>> I achieved satisfying results. All the queries were under one second.
>>>
>>> Sebastian, I took a look at your patch and I think it's more practical
>> than
>>> the current SamplingCandidateItemsStrategy, however it still doesn't put
>> a
>>> strict cap on the number of possible item IDs like my implementation
>> does.
>>> Perhaps there is room for both implementations?
>>>
>>>
>>>
>>> On Sun, Dec 4, 2011 at 11:13 AM, Sebastian Schelter <[email protected]>
>>> wrote:
>>>
>>>> I created a jira to supply a non-distributed counterpart of the
>>>> sampling that is done in the distributed item similarity computation:
>>>>
>>>> https://issues.apache.org/jira/browse/MAHOUT-914
>>>>
>>>>
>>>> 2011/12/2 Sean Owen <[email protected]>:
>>>>> For your purposes, it's LogLikelihoodSimilarity. I made similar
>> changes
>>>> in
>>>>> other files. Ideally, just svn update to get all recent changes.
>>>>>
>>>>> On Fri, Dec 2, 2011 at 6:43 PM, Daniel Zohar <[email protected]>
>>> wrote:
>>>>>
>>>>>> Sean, can you tell me which files have you committed the changes to?
>>>> Thanks
>>>>
>>>
>>
> 

Reply via email to