Are you referring to my patch, MAHOUT-910? It does let you specify a hard cap, really -- if you place a limit of X, then at most X^2 item-item associations come out. Before you could not bound the result, really, since one user could rate a lot of items.
I think it's slightly more efficient and unbiased as users with few ratings will not have their ratings sampled out, and all users are equally likely to be sampled out. What do you think? Yes you could easily add a secondary cap though as a final filter. On Sun, Dec 4, 2011 at 11:43 AM, Daniel Zohar <[email protected]> wrote: > Combining the latest commits with my > optimized-SamplingCandidateItemsStrategy (http://pastebin.com/6n9C8Pw1) > I achieved satisfying results. All the queries were under one second. > > Sebastian, I took a look at your patch and I think it's more practical than > the current SamplingCandidateItemsStrategy, however it still doesn't put a > strict cap on the number of possible item IDs like my implementation does. > Perhaps there is room for both implementations? > > > > On Sun, Dec 4, 2011 at 11:13 AM, Sebastian Schelter <[email protected]> > wrote: > > > I created a jira to supply a non-distributed counterpart of the > > sampling that is done in the distributed item similarity computation: > > > > https://issues.apache.org/jira/browse/MAHOUT-914 > > > > > > 2011/12/2 Sean Owen <[email protected]>: > > > For your purposes, it's LogLikelihoodSimilarity. I made similar changes > > in > > > other files. Ideally, just svn update to get all recent changes. > > > > > > On Fri, Dec 2, 2011 at 6:43 PM, Daniel Zohar <[email protected]> > wrote: > > > > > >> Sean, can you tell me which files have you committed the changes to? > > Thanks > > >
