Re: Mahout beginner questions...

Sean Owen Sun, 25 Mar 2012 12:25:34 -0700

It is memory. You will need a pretty large heap to put 100M data in memory
-- probably 4GB, if not a little more (so the machine would need 8GB+ RAM).
You can go bigger if you have more memory but that size seems about the
biggest to reasonably assume people have.

Of course more data slows things down and past about 10M data points you
need to tune things to sample data rather than try every possibility. This
is most of what CandidateItemStrategy has to do with. It is relatively easy
to tune this though so speed doesn't have to ben an issue.

Again you can go bigger and tune it to down-sample more; somehow I stil
believe that 100M is a crude but useful rule of thumb, as to the point
beyond which it's just hard to get good speed and quality.

Sean

On Sun, Mar 25, 2012 at 2:04 PM, Razon, Oren <[email protected]> wrote:

> Thanks for the detailed answer Sean.
> I want to understand more clearly the non-distributed code limitations.
> I saw that you advise that for more than 100,000,000 ratings the
> non-distributed engine won't do the job.
> The question is why? Is it memory issue (and then if I will have a bigger
> machine, meaning I could scale up), or is it because of the recommendation
> time it takes?
>
>

Re: Mahout beginner questions...

Reply via email to