Thanks for the explanation. I'm using hdfs.  what config parameters may help
speed up shuffling, merging, sorting and IO? For example,
*fs.inmemory.size.mb=
io.file.buffer.size=4096(default)
io.sort.factor=10(default)
io.sort.mb=100(default)
*
-aj

On Wed, Aug 18, 2010 at 1:30 AM, Andrzej Bialecki <[email protected]> wrote:

> On 2010-08-18 00:03, AJ Chen wrote:
>
>> Hi Andrzej,
>> During updatedb, reduce tasks (as seen in log) take most of the time.
>>
>
> Erhm.. ok, a short primer on reduce tasks ;) Reduce tasks are usually
> started pretty soon after you start the map tasks, BUT they just sit idle
> and wait for map tasks to finish. Whenever a map task finishes, a process
> called "shuffling" occurs, i.e. records that fall into that reduce task are
> shuffled from mapper output to reducer input. Still, the reduce does not run
> yet until ALL map tasks are finished, to ensure that all records have been
> shuffled. At which point the sorting begins, i.e. all shuffled parts that
> ended up at a particular reducer are sorted by key. And finally the last
> part begins, i.e. the reduce() operation itself, which produces output
> directly, without any further shuffling/sorting/post-processing.
>
> Each phase takes different time and for different reasons. Slow shuffling
> may indicate IO issues (disk, net), or an overload of the source nodes
> (mappers). Slow sorting indicates poor disk IO performance of reduce nodes
> (or again, too high load of reduce nodes). Slow reducing is usually caused
> by the slowness of reduce() itself (e.g. cpu intensive operations, or IO
> contention when writing output.
>
> Do you use HDFS? Do you use a network attached file store or local disks?
>
>
>
>
>  2010-08-17 09:31:54,653 INFO  mapred.ReduceTask - header:
>> attempt_201008141418_0023_m_000023_0, compressed len: 6, decompressed len:
>> 2
>> 2010-08-17 09:31:54,653 INFO  mapred.ReduceTask - Shuffling 2 bytes (6 raw
>> bytes) into RAM from attempt_201008141418_0023_m_000023_0
>>
>
> This indicates that there was no data in that particular shuffle (no keys
> from map task that fall into this reducer).
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to