Hi all, I have a conceptually simple problem. A user-item matrix, A, whose dimensions are ~2.6M rows x ~2.8M cols (~65M non-zeros). Running ALS with 20 features reduces this in the usual way to A = UM'. Trying to generate top-n (where n=100) recommendations for all users in U is quite a long process though. Essentially, for every user, it's generating a prediction for all unrated items in M then taking the top-n (all in-memory). I'm using the standard ALS `RecommenderJob` for this.
Considering that there are ~2.6M users and ~2.8M items, this is a really, really, time consuming way to find the top-n recommendations for all users in U. I feel like there could be a tricky way to avoid having to compute all item predictions of a user though. I can't find any reference in papers about improving this but at the moment, the estimate (with 10 mappers running the `RecommenderJob`) is ~80 hours. When I think about this problem I wonder if applying kNN or local sensitive min-hashing would somehow help me. Basically find the nearest neighbours directly and calculate predictions on those items only and not every item in M. On the flip side, I could start to reduce the item space, since it's quite large, basically start removing items that have low in-degrees since these probably don't contribute too much to the final recommendations. I don't like this so much though as it could remove some of the long-tail recommendations. At least, that is my intuition :) Thoughts anyone? Thanks in advance, Josh
