Right. And the documentation provides a list of operations that can be parallelized.
On Jun 1, 2012, at 4:50 PM, Dmitriy Ryaboy <[email protected]> wrote: > That being said, some operators such as "group all" and limit, do require > using only 1 reducer, by nature. So it depends on what your script is doing. > > On Jun 1, 2012, at 12:26 PM, Prashant Kommireddi <[email protected]> wrote: > >> Automatic Heuristic works the same in 0.9.1 >> http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be >> better off setting it manually looking at job tracker counters. >> >> You should be fine with using PARALLEL for any of the operators mentioned >> on the doc. >> >> -Prashant >> >> >> On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta <[email protected]> wrote: >> >>> Hi Prashant, >>> >>> Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems like a >>> very useful upgrade. For the moment though it seems that I should be able >>> to use the 1GB per reducer heuristic and specify the number of reducers in >>> Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this sound >>> right? >>> >>> Thanks, >>> Pankaj >>> >>> >>> On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote: >>> >>>> Also, please note default number of reducers are based on input dataset. >>> In >>>> the basic case, Pig will "automatically" spawn a reducer for each GB of >>>> input, so if your input dataset size is 500 GB you should see 500 >>> reducers >>>> being spawned (though this is excessive in a lot of cases). >>>> >>>> This document talks about parallelism >>>> http://pig.apache.org/docs/r0.10.0/perf.html#parallel >>>> >>>> Setting the right number of reducers (PARALLEL or set default_parallel) >>>> depends on what you are doing with it. If the reducer is CPU intensive >>> (may >>>> be a complex UDF running on reducer side), you would probably spawn more >>>> reducers. Otherwise (in most cases), the suggestion in the doc (1 GB per >>>> reducer) holds good for regular aggregations (SUM, COUNT..). >>>> >>>> >>>> 1. Take a look at Reduce Shuffle Bytes for the job on JobTracker >>>> 2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB >>>> of reduce shuffle bytes and see if it performs well >>>> 3. If not, adjust it according to your Reducer heap size. More the >>> heap, >>>> less is the data spilled to disk. >>>> >>>> There are a few more properties on the Reduce side (buffer size etc) but >>>> that probably is not required to start with. >>>> >>>> Thanks, >>>> >>>> Prashant >>>> >>>> >>>> >>>> >>>> On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <[email protected] >>>> wrote: >>>> >>>>> Pankaj, >>>>> >>>>> What version of pig are you using? In later versions of pig, it should >>> have >>>>> some logic around automatically setting parallelisms (though sometimes >>>>> these heuristics will be wrong). >>>>> >>>>> There are also some operations which will force you to use 1 reducer. It >>>>> depends on what your script is doing. >>>>> >>>>> 2012/6/1 Pankaj Gupta <[email protected]> >>>>> >>>>>> Hi, >>>>>> >>>>>> I just realized that one of my large scale pig jobs that has 100K map >>>>> jobs >>>>>> actually only has one reduce task. Reading the documentation I see that >>>>> the >>>>>> number of reduce tasks is defined by the PARALLEL clause whose default >>>>>> value is 1. I have a few questions around this: >>>>>> >>>>>> # Why is the default value of reduce tasks 1? >>>>>> # (Related to first question) Why aren't reduce tasks parallelized >>>>>> automatically in Pig? >>>>>> # How do I choose a good value of reduce tasks for my pig jobs? >>>>>> >>>>>> Thanks in Advance, >>>>>> Pankaj >>>>> >>> >>>
