On 2010-08-18 00:03, AJ Chen wrote:
Hi Andrzej, During updatedb, reduce tasks (as seen in log) take most of the time.
Erhm.. ok, a short primer on reduce tasks ;) Reduce tasks are usually started pretty soon after you start the map tasks, BUT they just sit idle and wait for map tasks to finish. Whenever a map task finishes, a process called "shuffling" occurs, i.e. records that fall into that reduce task are shuffled from mapper output to reducer input. Still, the reduce does not run yet until ALL map tasks are finished, to ensure that all records have been shuffled. At which point the sorting begins, i.e. all shuffled parts that ended up at a particular reducer are sorted by key. And finally the last part begins, i.e. the reduce() operation itself, which produces output directly, without any further shuffling/sorting/post-processing.
Each phase takes different time and for different reasons. Slow shuffling may indicate IO issues (disk, net), or an overload of the source nodes (mappers). Slow sorting indicates poor disk IO performance of reduce nodes (or again, too high load of reduce nodes). Slow reducing is usually caused by the slowness of reduce() itself (e.g. cpu intensive operations, or IO contention when writing output.
Do you use HDFS? Do you use a network attached file store or local disks?
2010-08-17 09:31:54,653 INFO mapred.ReduceTask - header: attempt_201008141418_0023_m_000023_0, compressed len: 6, decompressed len: 2 2010-08-17 09:31:54,653 INFO mapred.ReduceTask - Shuffling 2 bytes (6 raw bytes) into RAM from attempt_201008141418_0023_m_000023_0
This indicates that there was no data in that particular shuffle (no keys from map task that fall into this reducer).
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

