Re: LDA on single node is much faster than 20 nodes

Chris Lu Tue, 06 Sep 2011 16:40:32 -0700

I see, thanks!

Seems it should build into Mahout LDA algorithms, since the input fileis usually not too large, but really needs parallel mapping processes.


Chris

On 09/06/2011 04:28 PM, Jake Mannix wrote:

You can't just set the block size, you need to modify the InputFormat to
change
the number of splits.  For example, you can do:

     FileInputFormat.setMaxInputSplitSize(job, maxSizeInBytes);

and you'll force it to make more splits in your data set, and hence more
mappers.

   -jake

On Tue, Sep 6, 2011 at 4:12 PM, Dhruv Kumar<[email protected]>  wrote:

On Tue, Sep 6, 2011 at 6:57 PM, Chris Lu<[email protected]>  wrote:

Thanks. Very helpful to me!

I tried to change the setting of "mapred.map.tasks".  However, the number
map task is still just one on one of the 20 machines.

./elastic-mapreduce --create --alive \
   --num-instances 20 --name "LDA" \
   --bootstrap-action

s3://elasticmapreduce/**bootstrap-actions/configure-*

*hadoop \
   --bootstrap-name "Configuring number of map tasks per job" \
   --args "-m,mapred.map.tasks=40"

Anyone knows how to configure the number of mappers?
Again, the input size is only 46M.

Chris


On 09/06/2011 12:09 PM, Ted Dunning wrote:

Well, I think that using small instances is a disaster in general.  The
performance that you get from them can vary easily by an order of
magnitude.
  My own preference for real work is either m2xl or cc14xl.  The latter
machines give you nearly bare metal performance and no noisy neighbors.
  The
m2xl is typically very much underpriced on the spot market.

Sean is right about your job being misconfigured.  The Hadoop overhead

is

considerable and you have only given it two threads to overcome that
overhead.

On Tue, Sep 6, 2011 at 6:12 PM, Sean Owen<[email protected]>   wrote:

  That's your biggest issue, certainly. Only 2 mappers are running, even

though you have 20 machines available. Hadoop determines the number of
mappers based on input size, and your input isn't so big that it thinks
you
need 20 workers. It's launching 33 reducers, so your cluster is put to
use
there. But it's no wonder you're not seeing anything like 20x speedup

in

the
mapper.

You can of course force it to use more mappers, and that's probably a
good
idea here. -Dmapred.map.tasks=20 perhaps. More mappers means more
overhead
of spinning up mappers to process less data, and Hadoop's guess

indicates

that it thinks it's not efficient to use 20 workers. If you know that
those
other 18 are otherwise idle, my guess is you'd benefit from just making
it
use 20.

Sean,

I too have always been confused about how Hadoop decides to set the number
of mappers so you could help my understanding here...

Is -Dmapred.map.tasks just a hint to the framework for the number of
mappers
(just like using the combiner is a hint) or does it actually set the number
of workers to that number (provided our input is large enough)?

The reason I ask is because on
http://wiki.apache.org/hadoop/HowManyMapsAndReduces, it is mentioned that
the framework uses the HDFS block size to decide on the number of mapper
workers to be invoked. Should we be setting that parameter instead?

If this were a general large cluster where many people are taking
advantage
of the workers, then I'd trust Hadoop's guesses until you are sure  you
want
to do otherwise.

On Tue, Sep 6, 2011 at 7:02 PM, Chris Lu<[email protected]>   wrote:

  Thanks for all the suggestions!

All the inputs are the same. It takes 85 hours for 4 iterations on 20
Amazon small machines. On my local single node, it got to iteration 19

for

also 85 hours.

Here is a section of the Amazon log output.
It covers the start of iteration 1, and between iteration 4 and
iteration
5.

The number of map tasks is set to 2. Should it be larger or related to
number of CPU cores?

Re: LDA on single node is much faster than 20 nodes

Reply via email to