Ah OK, so this is quite a big problem. Still, it is quite useful to be able to make recommendations in real-time, or near-real-time. It saves the relatively quite large cost of precomputing, and lets you respond immediately to new data. If the site has a lot of occasional or new users, that can make a huge difference -- if I visit once, or once a month, precomputing recommendations every day from tomorrow doesn't help much.
Of course, that can be difficult to reconcile with <100ms response times, but with some tricks like LSH and some reasonable hardware I think you'd find it possible at this scale. It does take a lot of engineering. On Tue, Mar 5, 2013 at 9:43 PM, Josh Devins <[email protected]> wrote: > Thanks Sean, at least I know I'm mostly on the right track ;) > > So in our case (a large, social, consumer website), this is already a small > subset of all users (and items for that matter) and is really only the > active users. In fact, in future iterations, the number of users will > likely grow by around 3x (or at least, that's my optimistic target). So > it's not very likely to be able to calculate recommendations for fewer > users, but I like the idea of leaving all items in the matrix but not > computing preference predictions for all of them. I will think on this and > see if it fits for our domain (probably will work), and maybe a pull > request to Mahout if I can make this generic in some way! LSH was my > instinctual approach also but wasn't totally sure if this was sane! I'll > have a look into this as well if needed. > > Thanks for the advice! > > Josh > > > > On 5 March 2013 22:23, Sean Owen <[email protected]> wrote: > > > Without any tricks, yes you have to do this much work to really know > which > > are the largest values in UM' for every row. There's not an obvious twist > > that speeds it up. > > > > (Do you really want to compute all user recommendations? how many of the > > 2.6M are likely to be active soon, or, ever?) > > > > First, usually it's only a subset of all items that are recommendable > > anyway. You don't want them out of the model but don't need to consider > > them. This is domain specific of course, but, if 90% of the items are > "out > > of stock" or something, of course you can not bother to score them in the > > first place > > > > Yes, LSH is exactly what I do as well. You hash the item feature vectors > > into buckets and then only iterate over nearby buckets to find > candidates. > > You can avoid looking at 90+% of candidates this way without much if any > > impact on top N. > > > > Pruning is indeed third on the list but usually you get the problem to a > > pretty good size from the points above. > > > > > > > > On Tue, Mar 5, 2013 at 9:15 PM, Josh Devins <[email protected]> wrote: > > > > > Hi all, > > > > > > I have a conceptually simple problem. A user-item matrix, A, whose > > > dimensions are ~2.6M rows x ~2.8M cols (~65M non-zeros). Running ALS > with > > > 20 features reduces this in the usual way to A = UM'. Trying to > generate > > > top-n (where n=100) recommendations for all users in U is quite a long > > > process though. Essentially, for every user, it's generating a > prediction > > > for all unrated items in M then taking the top-n (all in-memory). I'm > > using > > > the standard ALS `RecommenderJob` for this. > > > > > > Considering that there are ~2.6M users and ~2.8M items, this is a > really, > > > really, time consuming way to find the top-n recommendations for all > > users > > > in U. I feel like there could be a tricky way to avoid having to > compute > > > all item predictions of a user though. I can't find any reference in > > papers > > > about improving this but at the moment, the estimate (with 10 mappers > > > running the `RecommenderJob`) is ~80 hours. When I think about this > > problem > > > I wonder if applying kNN or local sensitive min-hashing would somehow > > help > > > me. Basically find the nearest neighbours directly and calculate > > > predictions on those items only and not every item in M. On the flip > > side, > > > I could start to reduce the item space, since it's quite large, > basically > > > start removing items that have low in-degrees since these probably > don't > > > contribute too much to the final recommendations. I don't like this so > > much > > > though as it could remove some of the long-tail recommendations. At > > least, > > > that is my intuition :) > > > > > > Thoughts anyone? > > > > > > Thanks in advance, > > > > > > Josh > > > > > >
