Hi AJ,

On Aug 18, 2010, at 7:26am, AJ Chen wrote:

Thanks for the explanation. I'm using hdfs. what config parameters may help
speed up shuffling, merging, sorting and IO? For example,
*fs.inmemory.size.mb=
io.file.buffer.size=4096(default)
io.sort.factor=10(default)
io.sort.mb=100(default)

Two biggest wins are usually:

1. turn on map output compression (defaultcodec is fine, LZO would be better)

2. increase io.sort.mb as much as you can without triggering swap hell on your box (depends on total memory, how much is allocated for DataNode/TaskTracker and the child JVMs for map/reduce tasks.

-- Ken

On Wed, Aug 18, 2010 at 1:30 AM, Andrzej Bialecki <[email protected]> wrote:

On 2010-08-18 00:03, AJ Chen wrote:

Hi Andrzej,
During updatedb, reduce tasks (as seen in log) take most of the time.


Erhm.. ok, a short primer on reduce tasks ;) Reduce tasks are usually
started pretty soon after you start the map tasks, BUT they just sit idle and wait for map tasks to finish. Whenever a map task finishes, a process called "shuffling" occurs, i.e. records that fall into that reduce task are shuffled from mapper output to reducer input. Still, the reduce does not run yet until ALL map tasks are finished, to ensure that all records have been shuffled. At which point the sorting begins, i.e. all shuffled parts that ended up at a particular reducer are sorted by key. And finally the last part begins, i.e. the reduce() operation itself, which produces output
directly, without any further shuffling/sorting/post-processing.

Each phase takes different time and for different reasons. Slow shuffling may indicate IO issues (disk, net), or an overload of the source nodes (mappers). Slow sorting indicates poor disk IO performance of reduce nodes (or again, too high load of reduce nodes). Slow reducing is usually caused by the slowness of reduce() itself (e.g. cpu intensive operations, or IO
contention when writing output.

Do you use HDFS? Do you use a network attached file store or local disks?




2010-08-17 09:31:54,653 INFO  mapred.ReduceTask - header:
attempt_201008141418_0023_m_000023_0, compressed len: 6, decompressed len:
2
2010-08-17 09:31:54,653 INFO mapred.ReduceTask - Shuffling 2 bytes (6 raw
bytes) into RAM from attempt_201008141418_0023_m_000023_0


This indicates that there was no data in that particular shuffle (no keys
from map task that fall into this reducer).


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to