Thanks Ken for the tips.  -aj

On Wed, Aug 18, 2010 at 9:17 AM, Ken Krugler <[email protected]>wrote:

> Hi AJ,
>
>
> On Aug 18, 2010, at 7:26am, AJ Chen wrote:
>
>  Thanks for the explanation. I'm using hdfs.  what config parameters may
>> help
>> speed up shuffling, merging, sorting and IO? For example,
>> *fs.inmemory.size.mb=
>> io.file.buffer.size=4096(default)
>> io.sort.factor=10(default)
>> io.sort.mb=100(default)
>>
>
> Two biggest wins are usually:
>
> 1. turn on map output compression (defaultcodec is fine, LZO would be
> better)
>
> 2. increase io.sort.mb as much as you can without triggering swap hell on
> your box (depends on total memory, how much is allocated for
> DataNode/TaskTracker and the child JVMs for map/reduce tasks.
>
> -- Ken
>
>
>  On Wed, Aug 18, 2010 at 1:30 AM, Andrzej Bialecki <[email protected]> wrote:
>>
>>  On 2010-08-18 00:03, AJ Chen wrote:
>>>
>>>  Hi Andrzej,
>>>> During updatedb, reduce tasks (as seen in log) take most of the time.
>>>>
>>>>
>>> Erhm.. ok, a short primer on reduce tasks ;) Reduce tasks are usually
>>> started pretty soon after you start the map tasks, BUT they just sit idle
>>> and wait for map tasks to finish. Whenever a map task finishes, a process
>>> called "shuffling" occurs, i.e. records that fall into that reduce task
>>> are
>>> shuffled from mapper output to reducer input. Still, the reduce does not
>>> run
>>> yet until ALL map tasks are finished, to ensure that all records have
>>> been
>>> shuffled. At which point the sorting begins, i.e. all shuffled parts that
>>> ended up at a particular reducer are sorted by key. And finally the last
>>> part begins, i.e. the reduce() operation itself, which produces output
>>> directly, without any further shuffling/sorting/post-processing.
>>>
>>> Each phase takes different time and for different reasons. Slow shuffling
>>> may indicate IO issues (disk, net), or an overload of the source nodes
>>> (mappers). Slow sorting indicates poor disk IO performance of reduce
>>> nodes
>>> (or again, too high load of reduce nodes). Slow reducing is usually
>>> caused
>>> by the slowness of reduce() itself (e.g. cpu intensive operations, or IO
>>> contention when writing output.
>>>
>>> Do you use HDFS? Do you use a network attached file store or local disks?
>>>
>>>
>>>
>>>
>>> 2010-08-17 09:31:54,653 INFO  mapred.ReduceTask - header:
>>>
>>>> attempt_201008141418_0023_m_000023_0, compressed len: 6, decompressed
>>>> len:
>>>> 2
>>>> 2010-08-17 09:31:54,653 INFO  mapred.ReduceTask - Shuffling 2 bytes (6
>>>> raw
>>>> bytes) into RAM from attempt_201008141418_0023_m_000023_0
>>>>
>>>>
>>> This indicates that there was no data in that particular shuffle (no keys
>>> from map task that fall into this reducer).
>>>
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>> ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>>
>>
>> --
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
>>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to