Re: Number of reduce tasks

Dmitriy Ryaboy Fri, 01 Jun 2012 16:50:04 -0700

That being said, some operators such as "group all" and limit, do require using 
only 1 reducer, by nature. So it depends on what your script is doing.


On Jun 1, 2012, at 12:26 PM, Prashant Kommireddi <[email protected]> wrote:

> Automatic Heuristic works the same in 0.9.1
> http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be
> better off setting it manually looking at job tracker counters.
> 
> You should be fine with using PARALLEL for any of the operators mentioned
> on the doc.
> 
> -Prashant
> 
> 
> On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta <[email protected]> wrote:
> 
>> Hi Prashant,
>> 
>> Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems like a
>> very useful upgrade. For the moment though it seems that I should be able
>> to use the 1GB per reducer heuristic and specify the number of reducers in
>> Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this sound
>> right?
>> 
>> Thanks,
>> Pankaj
>> 
>> 
>> On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote:
>> 
>>> Also, please note default number of reducers are based on input dataset.
>> In
>>> the basic case, Pig will "automatically" spawn a reducer for each GB of
>>> input, so if your input dataset size is 500 GB you should see 500
>> reducers
>>> being spawned (though this is excessive in a lot of cases).
>>> 
>>> This document talks about parallelism
>>> http://pig.apache.org/docs/r0.10.0/perf.html#parallel
>>> 
>>> Setting the right number of reducers (PARALLEL or set default_parallel)
>>> depends on what you are doing with it. If the reducer is CPU intensive
>> (may
>>> be a complex UDF running on reducer side), you would probably spawn more
>>> reducers. Otherwise (in most cases), the suggestion in the doc (1 GB per
>>> reducer) holds good for regular aggregations (SUM, COUNT..).
>>> 
>>> 
>>>  1. Take a look at Reduce Shuffle Bytes for the job on JobTracker
>>>  2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB
>>>  of reduce shuffle bytes and see if it performs well
>>>  3. If not, adjust it according to your Reducer heap size. More the
>> heap,
>>>  less is the data spilled to disk.
>>> 
>>> There are a few more properties on the Reduce side (buffer size etc) but
>>> that probably is not required to start with.
>>> 
>>> Thanks,
>>> 
>>> Prashant
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <[email protected]
>>> wrote:
>>> 
>>>> Pankaj,
>>>> 
>>>> What version of pig are you using? In later versions of pig, it should
>> have
>>>> some logic around automatically setting parallelisms (though sometimes
>>>> these heuristics will be wrong).
>>>> 
>>>> There are also some operations which will force you to use 1 reducer. It
>>>> depends on what your script is doing.
>>>> 
>>>> 2012/6/1 Pankaj Gupta <[email protected]>
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I just realized that one of my large scale pig jobs that has 100K map
>>>> jobs
>>>>> actually only has one reduce task. Reading the documentation I see that
>>>> the
>>>>> number of reduce tasks is defined by the PARALLEL clause whose default
>>>>> value is 1. I have a few questions around this:
>>>>> 
>>>>> # Why is the default value of reduce tasks 1?
>>>>> # (Related to first question) Why aren't reduce tasks parallelized
>>>>> automatically in Pig?
>>>>> # How do I choose a good value of reduce tasks for my pig jobs?
>>>>> 
>>>>> Thanks in Advance,
>>>>> Pankaj
>>>> 
>> 
>>

Re: Number of reduce tasks

Reply via email to