My bad, did not realize you had pasted Counters on your previous message.

Please try playing with the reduce side properties I mentioned in my
previous reply (mapred.inmem... and mapred.job.reduce...).
Combiner is doing some good amount of work, please do not turn it off.

Also, the reducers are not doing a whole lot of aggregation -> # of reduce
input records approx equals # of reduce input groups. Which means more I/O
intensive than memory (that is if aggregation is the goal of reducer and
there is no huge deal of processing logic in there). Nothing you can tune
there.

Might be useful to play with io.sort.* properties on map side to reduce map
side spills.

Thanks,
Prashant

On Wed, Mar 14, 2012 at 2:01 PM, Prashant Kommireddi <[email protected]>wrote:

> Can you also provide numbers for Reduce Shuffle Bytes? And Combine Input
> and Output records? How many map slots do you have on the cluster? How many
> spills on the Map and Reduce side?
>
> If you are confident about shuffle time being the bottleneck, you could
> try tuning a couple of parameters
>
>    1. mapred.inmem.merge.threshold - # of map output to be merged at once
>    on reduce side. Set it to 0 so it depends on
>    mapred.job.reduce.input.buffer.percent
>    2. mapred.job.reduce.input.buffer.percent - you mentioned reduce is
>    not memory intensive, in which case you can try increasing this to 0.70 or
>    0.80
>    3. Make sure the combiners are doing work (aggregation). If not, you
>    could shut off combiner.
>
> You can also play with io.sort.mb and io.sort.factor which really depends
> on how much memory you have allocated each task (mapred.child.java.opts).
> Tuning depends on a lot of factors, you might have to dig deeper into the
> counters.
>
> Thanks,
>
> Prashant
>
> On Tue, Mar 13, 2012 at 5:24 AM, Austin Chungath <[email protected]>wrote:
>
>> Hi,
>> I am running a pig query on around 500 GB input data.
>> The current block size is 128 MB and split size is the default 128 MB.
>> I have also specified 16 reducers and around 3800 mappers are running.
>>
>> Now I observe that shuffling is taking a long time to complete execution,
>> approximately 25 mins per job.
>>
>> Can anyone suggest how I can bring down the shuffling time? Is there any
>> property that I can tweak to improve performance?
>>
>> Thanks & Regards,
>> Austin
>>
>
>

Reply via email to