Re: performance for small cluster

Ken Krugler Wed, 18 Aug 2010 09:17:57 -0700

Hi AJ,

On Aug 18, 2010, at 7:26am, AJ Chen wrote:

Thanks for the explanation. I'm using hdfs. what config parametersmay help
speed up shuffling, merging, sorting and IO? For example,
*fs.inmemory.size.mb=
io.file.buffer.size=4096(default)
io.sort.factor=10(default)
io.sort.mb=100(default)


Two biggest wins are usually:

1. turn on map output compression (defaultcodec is fine, LZO would bebetter)

2. increase io.sort.mb as much as you can without triggering swap hellon your box (depends on total memory, how much is allocated forDataNode/TaskTracker and the child JVMs for map/reduce tasks.


-- Ken

On Wed, Aug 18, 2010 at 1:30 AM, Andrzej Bialecki <[email protected]>wrote:
On 2010-08-18 00:03, AJ Chen wrote:
Hi Andrzej,
During updatedb, reduce tasks (as seen in log) take most of thetime.
Erhm.. ok, a short primer on reduce tasks ;) Reduce tasks are usually
started pretty soon after you start the map tasks, BUT they justsit idleand wait for map tasks to finish. Whenever a map task finishes, aprocesscalled "shuffling" occurs, i.e. records that fall into that reducetask areshuffled from mapper output to reducer input. Still, the reducedoes not runyet until ALL map tasks are finished, to ensure that all recordshave beenshuffled. At which point the sorting begins, i.e. all shuffledparts thatended up at a particular reducer are sorted by key. And finally thelastpart begins, i.e. the reduce() operation itself, which producesoutput
directly, without any further shuffling/sorting/post-processing.
Each phase takes different time and for different reasons. Slowshufflingmay indicate IO issues (disk, net), or an overload of the sourcenodes(mappers). Slow sorting indicates poor disk IO performance ofreduce nodes(or again, too high load of reduce nodes). Slow reducing is usuallycausedby the slowness of reduce() itself (e.g. cpu intensive operations,or IO
contention when writing output.
Do you use HDFS? Do you use a network attached file store or localdisks?
2010-08-17 09:31:54,653 INFO  mapred.ReduceTask - header:
attempt_201008141418_0023_m_000023_0, compressed len: 6,decompressed len:
2
2010-08-17 09:31:54,653 INFO mapred.ReduceTask - Shuffling 2bytes (6 raw
bytes) into RAM from attempt_201008141418_0023_m_000023_0
This indicates that there was no data in that particular shuffle(no keys
from map task that fall into this reducer).


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: performance for small cluster

Reply via email to