Ah OK, so this is quite a big problem. Still, it is quite useful to be able
to make recommendations in real-time, or near-real-time. It saves the
relatively quite large cost of precomputing, and lets you respond
immediately to new data. If the site has a lot of occasional or new users,
that can make a huge difference -- if I visit once, or once a month,
precomputing recommendations every day from tomorrow doesn't help much.

Of course, that can be difficult to reconcile with <100ms response times,
but with some tricks like LSH and some reasonable hardware I think you'd
find it possible at this scale. It does take a lot of engineering.



On Tue, Mar 5, 2013 at 9:43 PM, Josh Devins <[email protected]> wrote:

> Thanks Sean, at least I know I'm mostly on the right track ;)
>
> So in our case (a large, social, consumer website), this is already a small
> subset of all users (and items for that matter) and is really only the
> active users. In fact, in future iterations, the number of users will
> likely grow by around 3x (or at least, that's my optimistic target). So
> it's not very likely to be able to calculate recommendations for fewer
> users, but I like the idea of leaving all items in the matrix but not
> computing preference predictions for all of them. I will think on this and
> see if it fits for our domain (probably will work), and maybe a pull
> request to Mahout if I can make this generic in some way! LSH was my
> instinctual approach also but wasn't totally sure if this was sane! I'll
> have a look into this as well if needed.
>
> Thanks for the advice!
>
> Josh
>
>
>
> On 5 March 2013 22:23, Sean Owen <[email protected]> wrote:
>
> > Without any tricks, yes you have to do this much work to really know
> which
> > are the largest values in UM' for every row. There's not an obvious twist
> > that speeds it up.
> >
> > (Do you really want to compute all user recommendations? how many of the
> > 2.6M are likely to be active soon, or, ever?)
> >
> > First, usually it's only a subset of all items that are recommendable
> > anyway. You don't want them out of the model but don't need to consider
> > them. This is domain specific of course, but, if 90% of the items are
> "out
> > of stock" or something, of course you can not bother to score them in the
> > first place
> >
> > Yes, LSH is exactly what I do as well. You hash the item feature vectors
> > into buckets and then only iterate over nearby buckets to find
> candidates.
> > You can avoid looking at 90+% of candidates this way without much if any
> > impact on top N.
> >
> > Pruning is indeed third on the list but usually you get the problem to a
> > pretty good size from the points above.
> >
> >
> >
> > On Tue, Mar 5, 2013 at 9:15 PM, Josh Devins <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > I have a conceptually simple problem. A user-item matrix, A, whose
> > > dimensions are ~2.6M rows x ~2.8M cols (~65M non-zeros). Running ALS
> with
> > > 20 features reduces this in the usual way to A = UM'. Trying to
> generate
> > > top-n (where n=100) recommendations for all users in U is quite a long
> > > process though. Essentially, for every user, it's generating a
> prediction
> > > for all unrated items in M then taking the top-n (all in-memory). I'm
> > using
> > > the standard ALS `RecommenderJob` for this.
> > >
> > > Considering that there are ~2.6M users and ~2.8M items, this is a
> really,
> > > really, time consuming way to find the top-n recommendations for all
> > users
> > > in U. I feel like there could be a tricky way to avoid having to
> compute
> > > all item predictions of a user though. I can't find any reference in
> > papers
> > > about improving this but at the moment, the estimate (with 10 mappers
> > > running the `RecommenderJob`) is ~80 hours. When I think about this
> > problem
> > > I wonder if applying kNN or local sensitive min-hashing would somehow
> > help
> > > me. Basically find the nearest neighbours directly and calculate
> > > predictions on those items only and not every item in M. On the flip
> > side,
> > > I could start to reduce the item space, since it's quite large,
> basically
> > > start removing items that have low in-degrees since these probably
> don't
> > > contribute too much to the final recommendations. I don't like this so
> > much
> > > though as it could remove some of the long-tail recommendations. At
> > least,
> > > that is my intuition :)
> > >
> > > Thoughts anyone?
> > >
> > > Thanks in advance,
> > >
> > > Josh
> > >
> >
>

Reply via email to