It is memory. You will need a pretty large heap to put 100M data in memory -- probably 4GB, if not a little more (so the machine would need 8GB+ RAM). You can go bigger if you have more memory but that size seems about the biggest to reasonably assume people have.
Of course more data slows things down and past about 10M data points you need to tune things to sample data rather than try every possibility. This is most of what CandidateItemStrategy has to do with. It is relatively easy to tune this though so speed doesn't have to ben an issue. Again you can go bigger and tune it to down-sample more; somehow I stil believe that 100M is a crude but useful rule of thumb, as to the point beyond which it's just hard to get good speed and quality. Sean On Sun, Mar 25, 2012 at 2:04 PM, Razon, Oren <[email protected]> wrote: > Thanks for the detailed answer Sean. > I want to understand more clearly the non-distributed code limitations. > I saw that you advise that for more than 100,000,000 ratings the > non-distributed engine won't do the job. > The question is why? Is it memory issue (and then if I will have a bigger > machine, meaning I could scale up), or is it because of the recommendation > time it takes? > >
