Hi AJ,
On Aug 18, 2010, at 7:26am, AJ Chen wrote:
Thanks for the explanation. I'm using hdfs. what config parameters
may help
speed up shuffling, merging, sorting and IO? For example,
*fs.inmemory.size.mb=
io.file.buffer.size=4096(default)
io.sort.factor=10(default)
io.sort.mb=100(default)
Two biggest wins are usually:
1. turn on map output compression (defaultcodec is fine, LZO would be
better)
2. increase io.sort.mb as much as you can without triggering swap hell
on your box (depends on total memory, how much is allocated for
DataNode/TaskTracker and the child JVMs for map/reduce tasks.
-- Ken
On Wed, Aug 18, 2010 at 1:30 AM, Andrzej Bialecki <[email protected]>
wrote:
On 2010-08-18 00:03, AJ Chen wrote:
Hi Andrzej,
During updatedb, reduce tasks (as seen in log) take most of the
time.
Erhm.. ok, a short primer on reduce tasks ;) Reduce tasks are usually
started pretty soon after you start the map tasks, BUT they just
sit idle
and wait for map tasks to finish. Whenever a map task finishes, a
process
called "shuffling" occurs, i.e. records that fall into that reduce
task are
shuffled from mapper output to reducer input. Still, the reduce
does not run
yet until ALL map tasks are finished, to ensure that all records
have been
shuffled. At which point the sorting begins, i.e. all shuffled
parts that
ended up at a particular reducer are sorted by key. And finally the
last
part begins, i.e. the reduce() operation itself, which produces
output
directly, without any further shuffling/sorting/post-processing.
Each phase takes different time and for different reasons. Slow
shuffling
may indicate IO issues (disk, net), or an overload of the source
nodes
(mappers). Slow sorting indicates poor disk IO performance of
reduce nodes
(or again, too high load of reduce nodes). Slow reducing is usually
caused
by the slowness of reduce() itself (e.g. cpu intensive operations,
or IO
contention when writing output.
Do you use HDFS? Do you use a network attached file store or local
disks?
2010-08-17 09:31:54,653 INFO mapred.ReduceTask - header:
attempt_201008141418_0023_m_000023_0, compressed len: 6,
decompressed len:
2
2010-08-17 09:31:54,653 INFO mapred.ReduceTask - Shuffling 2
bytes (6 raw
bytes) into RAM from attempt_201008141418_0023_m_000023_0
This indicates that there was no data in that particular shuffle
(no keys
from map task that fall into this reducer).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g