That being said, some operators such as "group all" and limit, do require using only 1 reducer, by nature. So it depends on what your script is doing.
On Jun 1, 2012, at 12:26 PM, Prashant Kommireddi <[email protected]> wrote: > Automatic Heuristic works the same in 0.9.1 > http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be > better off setting it manually looking at job tracker counters. > > You should be fine with using PARALLEL for any of the operators mentioned > on the doc. > > -Prashant > > > On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta <[email protected]> wrote: > >> Hi Prashant, >> >> Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems like a >> very useful upgrade. For the moment though it seems that I should be able >> to use the 1GB per reducer heuristic and specify the number of reducers in >> Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this sound >> right? >> >> Thanks, >> Pankaj >> >> >> On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote: >> >>> Also, please note default number of reducers are based on input dataset. >> In >>> the basic case, Pig will "automatically" spawn a reducer for each GB of >>> input, so if your input dataset size is 500 GB you should see 500 >> reducers >>> being spawned (though this is excessive in a lot of cases). >>> >>> This document talks about parallelism >>> http://pig.apache.org/docs/r0.10.0/perf.html#parallel >>> >>> Setting the right number of reducers (PARALLEL or set default_parallel) >>> depends on what you are doing with it. If the reducer is CPU intensive >> (may >>> be a complex UDF running on reducer side), you would probably spawn more >>> reducers. Otherwise (in most cases), the suggestion in the doc (1 GB per >>> reducer) holds good for regular aggregations (SUM, COUNT..). >>> >>> >>> 1. Take a look at Reduce Shuffle Bytes for the job on JobTracker >>> 2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB >>> of reduce shuffle bytes and see if it performs well >>> 3. If not, adjust it according to your Reducer heap size. More the >> heap, >>> less is the data spilled to disk. >>> >>> There are a few more properties on the Reduce side (buffer size etc) but >>> that probably is not required to start with. >>> >>> Thanks, >>> >>> Prashant >>> >>> >>> >>> >>> On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <[email protected] >>> wrote: >>> >>>> Pankaj, >>>> >>>> What version of pig are you using? In later versions of pig, it should >> have >>>> some logic around automatically setting parallelisms (though sometimes >>>> these heuristics will be wrong). >>>> >>>> There are also some operations which will force you to use 1 reducer. It >>>> depends on what your script is doing. >>>> >>>> 2012/6/1 Pankaj Gupta <[email protected]> >>>> >>>>> Hi, >>>>> >>>>> I just realized that one of my large scale pig jobs that has 100K map >>>> jobs >>>>> actually only has one reduce task. Reading the documentation I see that >>>> the >>>>> number of reduce tasks is defined by the PARALLEL clause whose default >>>>> value is 1. I have a few questions around this: >>>>> >>>>> # Why is the default value of reduce tasks 1? >>>>> # (Related to first question) Why aren't reduce tasks parallelized >>>>> automatically in Pig? >>>>> # How do I choose a good value of reduce tasks for my pig jobs? >>>>> >>>>> Thanks in Advance, >>>>> Pankaj >>>> >> >>
