On Fri, Dec 2, 2011 at 6:42 PM, Sean Owen <[email protected]> wrote: > I'll commit #2 since that's an obvious win. > > Great.
> I'll make a patch with a proposal for #1, which is sort of like what you're > doing, just slightly different -- it gives a much stronger cap on the > number of interactions considered. > > Alright. > > For #3: You can't just store counts -- you need the user IDs to compute > intersection sizes. > > I'm suggesting to have an additional Map<Long, Long> under GenericBooleanPrefDataModel. There will still be GenericBooleanPrefDataModel.preferenceFromUsers and GenericBooleanPrefDataModel.preferenceForItems only that the latter might be a lot smaller in cases like mine. > If I'm right about the reason #3 makes it run faster, then #2 should make > it run faster by about the same amount -- so #2 subsumes #3. And if that's > true, I'd strongly prefer to not do #3 as it would add little and make > things incorrect. > > I don't think that the reason that #3 helps is because it makes fewer items > available as candidates, because it shouldn't. Say user A rates items X, Y > and Z. Say user B has rated only Y. When recommending for A, we'll find the > set of users who have rated Y (A, B and others). And when we look at B, > we'll add Y as a candidate item (the only item B rated). But Y will be > removed at the end since A already rated it. So, B didn't add any more > item-item interactions to consider. If you remove user B, you still have > the same work to do. > > I think #3 "helps" because it sets the count of users per item, for some > items, to 0 instead of 1, which avoids an intersection computation. But > that's later, after you have gotten the list of candidates. It's not > reducing the number of candidates, but ignoring a lot of them later by > assuming their similarity is 0 (when it isn't!) > > But that intersection computation doesn't have to be so slow. It's slow to > check whether each of 1000 items in one set are members of another set of 1 > item. But checking 1 item for membership in another set of 1000 is just one > quick hash lookup. > > Now, I agree that pretending that set has 0 items instead of 1 is even a > little faster still! And perhaps you could argue it's a reasonable > approximation. But my guess is that it's barely faster, at the price of > breaking correctness. > > I definitely agree that the correctness should not be broken. My solution is not meant to decrease the number of possible items like you stated in your example. It was meant to reduce the amount of item-user associations (while preserving user-item associations) which will results much less effort on intersectionSize(). Even in the case that we have two popular items, item A with 50k interactions and item B with 100k interactions, there are still 50k lookups inside intersectionSize(). If 50% of users interacted with that item interacted _only_ with that item, we are talking about a huge saving, are we not? For example, how many of the people who bought The Da Vinci Code on Amazon, had bought other books there as well? > On Fri, Dec 2, 2011 at 3:58 PM, Daniel Zohar <[email protected]> wrote: > > > It's already hard to say what has the most significant impact here. Just > to > > summarize, we had basically 3 changes which made an enormous improvement > in > > performance : > > 1. Limit the number of 'possibleItemIDs' returned > > by SamplingCandidateStrategy > > 2. Optimizing getNumUsersWithPreferenceFor() with > "larger.intersectionSize( > > smaller)" > > 3. Excluding 'singleton users' from > > GenericBooleanPrefDataModel.preferenceForItems > > > > I think we already agreed about 1 & 2 (correct me if I'm wrong). > > Regarding 3, if we had a Map that mapped itemID to numOfUsers that might > > solve the problem and still give a great performance boost. You can see > in > > my previous results that this change diminishes query time to at least a > > quarter (2s vs 0.5s). What do you think? > > > > >
