Thanks for the explanation. I'm using hdfs. what config parameters may help speed up shuffling, merging, sorting and IO? For example, *fs.inmemory.size.mb= io.file.buffer.size=4096(default) io.sort.factor=10(default) io.sort.mb=100(default) * -aj
On Wed, Aug 18, 2010 at 1:30 AM, Andrzej Bialecki <[email protected]> wrote: > On 2010-08-18 00:03, AJ Chen wrote: > >> Hi Andrzej, >> During updatedb, reduce tasks (as seen in log) take most of the time. >> > > Erhm.. ok, a short primer on reduce tasks ;) Reduce tasks are usually > started pretty soon after you start the map tasks, BUT they just sit idle > and wait for map tasks to finish. Whenever a map task finishes, a process > called "shuffling" occurs, i.e. records that fall into that reduce task are > shuffled from mapper output to reducer input. Still, the reduce does not run > yet until ALL map tasks are finished, to ensure that all records have been > shuffled. At which point the sorting begins, i.e. all shuffled parts that > ended up at a particular reducer are sorted by key. And finally the last > part begins, i.e. the reduce() operation itself, which produces output > directly, without any further shuffling/sorting/post-processing. > > Each phase takes different time and for different reasons. Slow shuffling > may indicate IO issues (disk, net), or an overload of the source nodes > (mappers). Slow sorting indicates poor disk IO performance of reduce nodes > (or again, too high load of reduce nodes). Slow reducing is usually caused > by the slowness of reduce() itself (e.g. cpu intensive operations, or IO > contention when writing output. > > Do you use HDFS? Do you use a network attached file store or local disks? > > > > > 2010-08-17 09:31:54,653 INFO mapred.ReduceTask - header: >> attempt_201008141418_0023_m_000023_0, compressed len: 6, decompressed len: >> 2 >> 2010-08-17 09:31:54,653 INFO mapred.ReduceTask - Shuffling 2 bytes (6 raw >> bytes) into RAM from attempt_201008141418_0023_m_000023_0 >> > > This indicates that there was no data in that particular shuffle (no keys > from map task that fall into this reducer). > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

