Re: LDA on single node is much faster than 20 nodes

Sean Owen Tue, 06 Sep 2011 11:12:46 -0700

That's your biggest issue, certainly. Only 2 mappers are running, even
though you have 20 machines available. Hadoop determines the number of
mappers based on input size, and your input isn't so big that it thinks you
need 20 workers. It's launching 33 reducers, so your cluster is put to use
there. But it's no wonder you're not seeing anything like 20x speedup in the
mapper.

You can of course force it to use more mappers, and that's probably a good
idea here. -Dmapred.map.tasks=20 perhaps. More mappers means more overhead
of spinning up mappers to process less data, and Hadoop's guess indicates
that it thinks it's not efficient to use 20 workers. If you know that those
other 18 are otherwise idle, my guess is you'd benefit from just making it
use 20.

If this were a general large cluster where many people are taking advantage
of the workers, then I'd trust Hadoop's guesses until you are sure  you want
to do otherwise.

On Tue, Sep 6, 2011 at 7:02 PM, Chris Lu <[email protected]> wrote:

> Thanks for all the suggestions!
>
> All the inputs are the same. It takes 85 hours for 4 iterations on 20
> Amazon small machines. On my local single node, it got to iteration 19 for
> also 85 hours.
>
> Here is a section of the Amazon log output.
> It covers the start of iteration 1, and between iteration 4 and iteration
> 5.
>
> The number of map tasks is set to 2. Should it be larger or related to
> number of CPU cores?
>
>

Re: LDA on single node is much faster than 20 nodes

Reply via email to