Thanks Sean, at least I know I'm mostly on the right track ;)

So in our case (a large, social, consumer website), this is already a small
subset of all users (and items for that matter) and is really only the
active users. In fact, in future iterations, the number of users will
likely grow by around 3x (or at least, that's my optimistic target). So
it's not very likely to be able to calculate recommendations for fewer
users, but I like the idea of leaving all items in the matrix but not
computing preference predictions for all of them. I will think on this and
see if it fits for our domain (probably will work), and maybe a pull
request to Mahout if I can make this generic in some way! LSH was my
instinctual approach also but wasn't totally sure if this was sane! I'll
have a look into this as well if needed.

Thanks for the advice!

Josh



On 5 March 2013 22:23, Sean Owen <[email protected]> wrote:

> Without any tricks, yes you have to do this much work to really know which
> are the largest values in UM' for every row. There's not an obvious twist
> that speeds it up.
>
> (Do you really want to compute all user recommendations? how many of the
> 2.6M are likely to be active soon, or, ever?)
>
> First, usually it's only a subset of all items that are recommendable
> anyway. You don't want them out of the model but don't need to consider
> them. This is domain specific of course, but, if 90% of the items are "out
> of stock" or something, of course you can not bother to score them in the
> first place
>
> Yes, LSH is exactly what I do as well. You hash the item feature vectors
> into buckets and then only iterate over nearby buckets to find candidates.
> You can avoid looking at 90+% of candidates this way without much if any
> impact on top N.
>
> Pruning is indeed third on the list but usually you get the problem to a
> pretty good size from the points above.
>
>
>
> On Tue, Mar 5, 2013 at 9:15 PM, Josh Devins <[email protected]> wrote:
>
> > Hi all,
> >
> > I have a conceptually simple problem. A user-item matrix, A, whose
> > dimensions are ~2.6M rows x ~2.8M cols (~65M non-zeros). Running ALS with
> > 20 features reduces this in the usual way to A = UM'. Trying to generate
> > top-n (where n=100) recommendations for all users in U is quite a long
> > process though. Essentially, for every user, it's generating a prediction
> > for all unrated items in M then taking the top-n (all in-memory). I'm
> using
> > the standard ALS `RecommenderJob` for this.
> >
> > Considering that there are ~2.6M users and ~2.8M items, this is a really,
> > really, time consuming way to find the top-n recommendations for all
> users
> > in U. I feel like there could be a tricky way to avoid having to compute
> > all item predictions of a user though. I can't find any reference in
> papers
> > about improving this but at the moment, the estimate (with 10 mappers
> > running the `RecommenderJob`) is ~80 hours. When I think about this
> problem
> > I wonder if applying kNN or local sensitive min-hashing would somehow
> help
> > me. Basically find the nearest neighbours directly and calculate
> > predictions on those items only and not every item in M. On the flip
> side,
> > I could start to reduce the item space, since it's quite large, basically
> > start removing items that have low in-degrees since these probably don't
> > contribute too much to the final recommendations. I don't like this so
> much
> > though as it could remove some of the long-tail recommendations. At
> least,
> > that is my intuition :)
> >
> > Thoughts anyone?
> >
> > Thanks in advance,
> >
> > Josh
> >
>

Reply via email to