Thanks Ken for the tips. -aj On Wed, Aug 18, 2010 at 9:17 AM, Ken Krugler <[email protected]>wrote:
> Hi AJ, > > > On Aug 18, 2010, at 7:26am, AJ Chen wrote: > > Thanks for the explanation. I'm using hdfs. what config parameters may >> help >> speed up shuffling, merging, sorting and IO? For example, >> *fs.inmemory.size.mb= >> io.file.buffer.size=4096(default) >> io.sort.factor=10(default) >> io.sort.mb=100(default) >> > > Two biggest wins are usually: > > 1. turn on map output compression (defaultcodec is fine, LZO would be > better) > > 2. increase io.sort.mb as much as you can without triggering swap hell on > your box (depends on total memory, how much is allocated for > DataNode/TaskTracker and the child JVMs for map/reduce tasks. > > -- Ken > > > On Wed, Aug 18, 2010 at 1:30 AM, Andrzej Bialecki <[email protected]> wrote: >> >> On 2010-08-18 00:03, AJ Chen wrote: >>> >>> Hi Andrzej, >>>> During updatedb, reduce tasks (as seen in log) take most of the time. >>>> >>>> >>> Erhm.. ok, a short primer on reduce tasks ;) Reduce tasks are usually >>> started pretty soon after you start the map tasks, BUT they just sit idle >>> and wait for map tasks to finish. Whenever a map task finishes, a process >>> called "shuffling" occurs, i.e. records that fall into that reduce task >>> are >>> shuffled from mapper output to reducer input. Still, the reduce does not >>> run >>> yet until ALL map tasks are finished, to ensure that all records have >>> been >>> shuffled. At which point the sorting begins, i.e. all shuffled parts that >>> ended up at a particular reducer are sorted by key. And finally the last >>> part begins, i.e. the reduce() operation itself, which produces output >>> directly, without any further shuffling/sorting/post-processing. >>> >>> Each phase takes different time and for different reasons. Slow shuffling >>> may indicate IO issues (disk, net), or an overload of the source nodes >>> (mappers). Slow sorting indicates poor disk IO performance of reduce >>> nodes >>> (or again, too high load of reduce nodes). Slow reducing is usually >>> caused >>> by the slowness of reduce() itself (e.g. cpu intensive operations, or IO >>> contention when writing output. >>> >>> Do you use HDFS? Do you use a network attached file store or local disks? >>> >>> >>> >>> >>> 2010-08-17 09:31:54,653 INFO mapred.ReduceTask - header: >>> >>>> attempt_201008141418_0023_m_000023_0, compressed len: 6, decompressed >>>> len: >>>> 2 >>>> 2010-08-17 09:31:54,653 INFO mapred.ReduceTask - Shuffling 2 bytes (6 >>>> raw >>>> bytes) into RAM from attempt_201008141418_0023_m_000023_0 >>>> >>>> >>> This indicates that there was no data in that particular shuffle (no keys >>> from map task that fall into this reducer). >>> >>> >>> -- >>> Best regards, >>> Andrzej Bialecki <>< >>> ___. ___ ___ ___ _ _ __________________________________ >>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>> ___|||__|| \| || | Embedded Unix, System Integration >>> http://www.sigram.com Contact: info at sigram dot com >>> >>> >>> >> >> -- >> AJ Chen, PhD >> Chair, Semantic Web SIG, sdforum.org >> http://web2express.org >> twitter @web2express >> Palo Alto, CA, USA >> > > -------------------------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

