Hi,
If you do cartesian join to predict users' preference over all the
products, I think that 8 nodes with 64GB ram would not be enough for the
data.
Recently, I used als for a similar situation, but just 10M users and 0.1M
products, the minimum requirement is 9 nodes with 10GB RAM.
Moreover,
Thanks much for your reply.
By saying on the fly, you mean caching the trained model, and querying it
for each user joined with 30M products when needed?
Our question is more about the general approach, what if we have 7M DAU?
How the companies deal with that using Spark?
On Wed, Mar 18, 2015
Not just the join, but this means you're trying to compute 600
trillion dot products. It will never finish fast. Basically: don't do
this :) You don't in general compute all recommendations for all
users, but recompute for a small subset of users that were or are
likely to be active soon. (Or
I don't think that you need memory to put the whole joined data set in
memory. However memory is unlikely to be the limiting factor, it's the
massive shuffle.
OK, you really do have a large recommendation problem if you're
recommending for at least 7M users per day!
My hunch is that it won't be
There is also a batch prediction API in PR
https://github.com/apache/spark/pull/3098
Idea here is what Sean said...don't try to reconstruct the whole matrix
which will be dense but pick a set of users and calculate topk
recommendations for them using dense level 3 blas.we are going to merge
Thanks gen for helpful post.
Thank you Sean, we're currently exploring this world of recommendations
with Spark, and your posts are very helpful to us.
We've noticed that you're a co-author of Advanced Analytics with Spark,
just not to get to deep into offtopic, will it be finished soon?
On Wed,