Hi Thiago, Subjectively: there are a number of items to consider to achieve nearly linear scaling:
- if the work is well balanced among the tasks - no skew - No skew in the association of tasks to nodes. Note: this skew actually happens by default if the number of tasks is less than the cluster capacity of slots. You will notice that on a cluster with 20 nodes, with each node set to 20 mapper tasks, if you launch a job with 20 maps it may well have all of them running on one node. - with higher number of tasks the risk of having stragglers affecting overall throughput/performance increases unless speculative execution were set properly - hadoop configuration settings come under more pressure with more - properly tuning the number of mappers and reducers to (a) your node and cluster characteristics and (b) the particular tasks has a large impact on performance. In my experience the settings are often set too conservatively / too low to take advantage of the node and cluster resources So in summary hadoop itself is capable of nearly linear scaling to low thousands of nodes, but configuring the cluster to really achieve that requires effort. 2013/1/17 Thiago Vieira <[email protected]> > Hello! > > Is common to see this sentence: "Hadoop Scales Linearly". But, is there > any performance evaluation to confirm this? > > In my evaluations, Hadoop processing capacity scales linearly, but not > proportional to number of nodes, the processing capacity achieved with 20 > nodes is not the double of the processing capacity achieved with 10 nodes. > Is there any evaluation about this? > > Thank you! > > -- > Thiago Vieira >
