Pankaj, are you using hcatalog? On Fri, Jun 1, 2012 at 5:24 PM, Prashant Kommireddi <[email protected]>wrote:
> Right. And the documentation provides a list of operations that can be > parallelized. > > On Jun 1, 2012, at 4:50 PM, Dmitriy Ryaboy <[email protected]> wrote: > > > That being said, some operators such as "group all" and limit, do > require using only 1 reducer, by nature. So it depends on what your script > is doing. > > > > On Jun 1, 2012, at 12:26 PM, Prashant Kommireddi <[email protected]> > wrote: > > > >> Automatic Heuristic works the same in 0.9.1 > >> http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be > >> better off setting it manually looking at job tracker counters. > >> > >> You should be fine with using PARALLEL for any of the operators > mentioned > >> on the doc. > >> > >> -Prashant > >> > >> > >> On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta <[email protected]> > wrote: > >> > >>> Hi Prashant, > >>> > >>> Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems > like a > >>> very useful upgrade. For the moment though it seems that I should be > able > >>> to use the 1GB per reducer heuristic and specify the number of > reducers in > >>> Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this > sound > >>> right? > >>> > >>> Thanks, > >>> Pankaj > >>> > >>> > >>> On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote: > >>> > >>>> Also, please note default number of reducers are based on input > dataset. > >>> In > >>>> the basic case, Pig will "automatically" spawn a reducer for each GB > of > >>>> input, so if your input dataset size is 500 GB you should see 500 > >>> reducers > >>>> being spawned (though this is excessive in a lot of cases). > >>>> > >>>> This document talks about parallelism > >>>> http://pig.apache.org/docs/r0.10.0/perf.html#parallel > >>>> > >>>> Setting the right number of reducers (PARALLEL or set > default_parallel) > >>>> depends on what you are doing with it. If the reducer is CPU intensive > >>> (may > >>>> be a complex UDF running on reducer side), you would probably spawn > more > >>>> reducers. Otherwise (in most cases), the suggestion in the doc (1 GB > per > >>>> reducer) holds good for regular aggregations (SUM, COUNT..). > >>>> > >>>> > >>>> 1. Take a look at Reduce Shuffle Bytes for the job on JobTracker > >>>> 2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB > >>>> of reduce shuffle bytes and see if it performs well > >>>> 3. If not, adjust it according to your Reducer heap size. More the > >>> heap, > >>>> less is the data spilled to disk. > >>>> > >>>> There are a few more properties on the Reduce side (buffer size etc) > but > >>>> that probably is not required to start with. > >>>> > >>>> Thanks, > >>>> > >>>> Prashant > >>>> > >>>> > >>>> > >>>> > >>>> On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <[email protected] > >>>> wrote: > >>>> > >>>>> Pankaj, > >>>>> > >>>>> What version of pig are you using? In later versions of pig, it > should > >>> have > >>>>> some logic around automatically setting parallelisms (though > sometimes > >>>>> these heuristics will be wrong). > >>>>> > >>>>> There are also some operations which will force you to use 1 > reducer. It > >>>>> depends on what your script is doing. > >>>>> > >>>>> 2012/6/1 Pankaj Gupta <[email protected]> > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> I just realized that one of my large scale pig jobs that has 100K > map > >>>>> jobs > >>>>>> actually only has one reduce task. Reading the documentation I see > that > >>>>> the > >>>>>> number of reduce tasks is defined by the PARALLEL clause whose > default > >>>>>> value is 1. I have a few questions around this: > >>>>>> > >>>>>> # Why is the default value of reduce tasks 1? > >>>>>> # (Related to first question) Why aren't reduce tasks parallelized > >>>>>> automatically in Pig? > >>>>>> # How do I choose a good value of reduce tasks for my pig jobs? > >>>>>> > >>>>>> Thanks in Advance, > >>>>>> Pankaj > >>>>> > >>> > >>> > -- "...:::Aniket:::... Quetzalco@tl"
