Re: Hadoop Scalability

Stephen Boesch Thu, 17 Jan 2013 21:48:15 -0800

Hi Thiago,
  Subjectively:  there are a number of items to consider to achieve nearly
linear scaling:



   - if the work is well balanced among the tasks - no skew
   - No skew in the association of tasks to nodes. Note: this skew actually
   happens by default if the number of tasks is less than the cluster capacity
   of slots.  You will notice that on a cluster with 20 nodes, with each node
   set to 20 mapper tasks, if you launch a job with 20 maps it may well have
   all of them running on one node.
   - with higher number of tasks the risk of having stragglers affecting
   overall throughput/performance increases unless speculative execution were
   set properly
   - hadoop configuration settings come under more pressure with more
   - properly tuning the number of mappers and reducers to (a) your node
   and cluster characteristics and (b) the particular tasks has a large impact
   on performance. In my experience the settings are often set too
   conservatively / too low to take advantage of the node and cluster
   resources

So in summary hadoop itself is capable of nearly linear scaling to low
thousands of nodes, but configuring the cluster to really achieve that
requires effort.


2013/1/17 Thiago Vieira <[email protected]>

> Hello!
>
> Is common to see this sentence: "Hadoop Scales Linearly". But, is there
> any performance evaluation to confirm this?
>
> In my evaluations, Hadoop processing capacity scales linearly, but not
> proportional to number of nodes, the processing capacity achieved with 20
> nodes is not the double of the processing capacity achieved with 10 nodes.
> Is there any evaluation about this?
>
> Thank you!
>
> --
> Thiago Vieira
>

Re: Hadoop Scalability

Reply via email to