Re: performance for small cluster

Andrzej Bialecki Wed, 18 Aug 2010 01:31:55 -0700

On 2010-08-18 00:03, AJ Chen wrote:

Hi Andrzej,
During updatedb, reduce tasks (as seen in log) take most of the time.

Erhm.. ok, a short primer on reduce tasks ;) Reduce tasks are usuallystarted pretty soon after you start the map tasks, BUT they just sitidle and wait for map tasks to finish. Whenever a map task finishes, aprocess called "shuffling" occurs, i.e. records that fall into thatreduce task are shuffled from mapper output to reducer input. Still, thereduce does not run yet until ALL map tasks are finished, to ensure thatall records have been shuffled. At which point the sorting begins, i.e.all shuffled parts that ended up at a particular reducer are sorted bykey. And finally the last part begins, i.e. the reduce() operationitself, which produces output directly, without any furthershuffling/sorting/post-processing.

Each phase takes different time and for different reasons. Slowshuffling may indicate IO issues (disk, net), or an overload of thesource nodes (mappers). Slow sorting indicates poor disk IO performanceof reduce nodes (or again, too high load of reduce nodes). Slow reducingis usually caused by the slowness of reduce() itself (e.g. cpu intensiveoperations, or IO contention when writing output.


Do you use HDFS? Do you use a network attached file store or local disks?

2010-08-17 09:31:54,653 INFO  mapred.ReduceTask - header:
attempt_201008141418_0023_m_000023_0, compressed len: 6, decompressed len: 2
2010-08-17 09:31:54,653 INFO  mapred.ReduceTask - Shuffling 2 bytes (6 raw
bytes) into RAM from attempt_201008141418_0023_m_000023_0

This indicates that there was no data in that particular shuffle (nokeys from map task that fall into this reducer).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: performance for small cluster

Reply via email to