Hi Daniel, My view is this: I think you can pretty safely down-sample power users like it is done in https://issues.apache.org/jira/browse/MAHOUT-914 I did some experiments on the movielens1M dataset that showed that you get a negligible error given you look at enough interactions per user:
https://issues.apache.org/jira/secure/attachment/12506028/downsampling.png I could also verify this on the movielens10M dataset. I think this kind of sampling works because the distribution of interactions with items in the power-users and in the whole dataset is very similar. Therefore you don't really learn anything new from the 'power-users'. The 'power-users' might also be crawlers or people sharing accounts in practice. However, I am not sure what happens when you also sample the number of items you look at. If I had to decide, I'd rather follow Ted's advice and kill super-popular items, as they are not helpful per-se. But if the additional item sampling helps in your usecase, I don't oppose including it in Mahout. I think its good to have a variety of candidate item strategies. You should however do some experimenting to see how much the sampling affects quality. An A/B test in a real application would be the best thing to do. --sebastian On 04.12.2011 13:12, Daniel Zohar wrote: > Actually I was referring to Sebastian's. I haven't seen you committedI can > anything to SamplingCandidateItemsStrategy. Can you tell me in which classI > can > the change appears? > > On Sun, Dec 4, 2011 at 2:06 PM, Sean Owen <[email protected]> wrote: > >> Are you referring to my patch, MAHOUT-910? >> >> It does let you specify a hard cap, really -- if you place a limit of X, >> then at most X^2 item-item associations come out. Before you could not >> bound the result, really, since one user could rate a lot of items. >> >> I think it's slightly more efficient and unbiased as users with few ratings >> will not have their ratings sampled out, and all users are equally likely >> to be sampled out. >> >> What do you think? >> Yes you could easily add a secondary cap though as a final filter. >> >> On Sun, Dec 4, 2011 at 11:43 AM, Daniel Zohar <[email protected]> wrote: >> >>> Combining the latest commits with my >>> optimized-SamplingCandidateItemsStrategy (http://pastebin.com/6n9C8Pw1) >>> I achieved satisfying results. All the queries were under one second. >>> >>> Sebastian, I took a look at your patch and I think it's more practical >> than >>> the current SamplingCandidateItemsStrategy, however it still doesn't put >> a >>> strict cap on the number of possible item IDs like my implementation >> does. >>> Perhaps there is room for both implementations? >>> >>> >>> >>> On Sun, Dec 4, 2011 at 11:13 AM, Sebastian Schelter <[email protected]> >>> wrote: >>> >>>> I created a jira to supply a non-distributed counterpart of the >>>> sampling that is done in the distributed item similarity computation: >>>> >>>> https://issues.apache.org/jira/browse/MAHOUT-914 >>>> >>>> >>>> 2011/12/2 Sean Owen <[email protected]>: >>>>> For your purposes, it's LogLikelihoodSimilarity. I made similar >> changes >>>> in >>>>> other files. Ideally, just svn update to get all recent changes. >>>>> >>>>> On Fri, Dec 2, 2011 at 6:43 PM, Daniel Zohar <[email protected]> >>> wrote: >>>>> >>>>>> Sean, can you tell me which files have you committed the changes to? >>>> Thanks >>>> >>> >> >
