Re: Number of reduce tasks

Aniket Mokashi Mon, 18 Jun 2012 01:29:43 -0700

Pankaj, are you using hcatalog?

On Fri, Jun 1, 2012 at 5:24 PM, Prashant Kommireddi <[email protected]>wrote:


> Right. And the documentation provides a list of operations that can be
> parallelized.
>
> On Jun 1, 2012, at 4:50 PM, Dmitriy Ryaboy <[email protected]> wrote:
>
> > That being said, some operators such as "group all" and limit, do
> require using only 1 reducer, by nature. So it depends on what your script
> is doing.
> >
> > On Jun 1, 2012, at 12:26 PM, Prashant Kommireddi <[email protected]>
> wrote:
> >
> >> Automatic Heuristic works the same in 0.9.1
> >> http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be
> >> better off setting it manually looking at job tracker counters.
> >>
> >> You should be fine with using PARALLEL for any of the operators
> mentioned
> >> on the doc.
> >>
> >> -Prashant
> >>
> >>
> >> On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta <[email protected]>
> wrote:
> >>
> >>> Hi Prashant,
> >>>
> >>> Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems
> like a
> >>> very useful upgrade. For the moment though it seems that I should be
> able
> >>> to use the 1GB per reducer heuristic and specify the number of
> reducers in
> >>> Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this
> sound
> >>> right?
> >>>
> >>> Thanks,
> >>> Pankaj
> >>>
> >>>
> >>> On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote:
> >>>
> >>>> Also, please note default number of reducers are based on input
> dataset.
> >>> In
> >>>> the basic case, Pig will "automatically" spawn a reducer for each GB
> of
> >>>> input, so if your input dataset size is 500 GB you should see 500
> >>> reducers
> >>>> being spawned (though this is excessive in a lot of cases).
> >>>>
> >>>> This document talks about parallelism
> >>>> http://pig.apache.org/docs/r0.10.0/perf.html#parallel
> >>>>
> >>>> Setting the right number of reducers (PARALLEL or set
> default_parallel)
> >>>> depends on what you are doing with it. If the reducer is CPU intensive
> >>> (may
> >>>> be a complex UDF running on reducer side), you would probably spawn
> more
> >>>> reducers. Otherwise (in most cases), the suggestion in the doc (1 GB
> per
> >>>> reducer) holds good for regular aggregations (SUM, COUNT..).
> >>>>
> >>>>
> >>>> 1. Take a look at Reduce Shuffle Bytes for the job on JobTracker
> >>>> 2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB
> >>>> of reduce shuffle bytes and see if it performs well
> >>>> 3. If not, adjust it according to your Reducer heap size. More the
> >>> heap,
> >>>> less is the data spilled to disk.
> >>>>
> >>>> There are a few more properties on the Reduce side (buffer size etc)
> but
> >>>> that probably is not required to start with.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Prashant
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <[email protected]
> >>>> wrote:
> >>>>
> >>>>> Pankaj,
> >>>>>
> >>>>> What version of pig are you using? In later versions of pig, it
> should
> >>> have
> >>>>> some logic around automatically setting parallelisms (though
> sometimes
> >>>>> these heuristics will be wrong).
> >>>>>
> >>>>> There are also some operations which will force you to use 1
> reducer. It
> >>>>> depends on what your script is doing.
> >>>>>
> >>>>> 2012/6/1 Pankaj Gupta <[email protected]>
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I just realized that one of my large scale pig jobs that has 100K
> map
> >>>>> jobs
> >>>>>> actually only has one reduce task. Reading the documentation I see
> that
> >>>>> the
> >>>>>> number of reduce tasks is defined by the PARALLEL clause whose
> default
> >>>>>> value is 1. I have a few questions around this:
> >>>>>>
> >>>>>> # Why is the default value of reduce tasks 1?
> >>>>>> # (Related to first question) Why aren't reduce tasks parallelized
> >>>>>> automatically in Pig?
> >>>>>> # How do I choose a good value of reduce tasks for my pig jobs?
> >>>>>>
> >>>>>> Thanks in Advance,
> >>>>>> Pankaj
> >>>>>
> >>>
> >>>
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Number of reduce tasks

Reply via email to