Re: Pig job is taking more time than Java M/R

praveenesh kumar Mon, 16 Jan 2012 01:09:39 -0800

Using PARALLEL I can reduce the total number of reducers and can reduce the
execution time.


Thanks,
Praveenesh

On Mon, Jan 16, 2012 at 12:43 PM, Prashant Kommireddi
<[email protected]>wrote:

> That explains it. Pig computes # of reducers using a heuristic based on
> input dataset size (1 reducer per GB). You would always want to use
> PARALLEL if the data being forwarded to reducers is not a lot.
>
> Please take a look at (PARALLEL syntax) :
> http://pig.apache.org/docs/r0.9.1/basic.html#distinct
>
> Thanks,
> Prashant
>
> On Sun, Jan 15, 2012 at 10:59 PM, praveenesh kumar <[email protected]
> >wrote:
>
> > I am using Apache Pig version 0.11.0-SNAPSHOT (r1225753) build from trunk
> > and Hadoop 0.20.205
> > Nothing else was running that time on cluster that time. and there was no
> > waiting for map-reduce slots.
> > Only difference I saw was for my Java M/R job, only 40 reducers were
> > running
> > whereas my pig job was running 457 reducers. I guess it may be because of
> > so many reducers running.
> > Can I control number of reducers running ?
> >
> > Thanks,
> > Praveenesh
> >
> >
> > On Mon, Jan 16, 2012 at 11:42 AM, Prashant Kommireddi
> > <[email protected]>wrote:
> >
> > > Hi Praveenesh,
> > >
> > > You can use 'EXPLAIN' to understand what Pig is doing under the hood
> (MR
> > > plan)
> > > http://pig.apache.org/docs/r0.9.1/test.html#explain
> > >
> > > What version of Pig and Hadoop are you using? I have never seen such a
> > huge
> > > difference between Java MR and Pig. At the time you ran Pig, was the
> > > cluster idle or did you have other jobs running at the same time? Did
> you
> > > make sure the job was not waiting on Map or Reduce slots being made
> > > available?
> > >
> > > Thanks,
> > > Prashant
> > >
> > > On Sun, Jan 15, 2012 at 9:47 PM, praveenesh kumar <
> [email protected]
> > > >wrote:
> > >
> > > > Hey Guys,
> > > >
> > > > Is there anyway through which I can see the M/R jobs that pig runs
> > > > internally for a given pig script ?
> > > > I wanted to get unique values for a particular column.
> > > >
> > > > For that I wrote the following script:
> > > >
> > > > Data = Load 'Data.csv' using PigStorage(',');
> > > > IDs = FOREACH Data GENERATE $0;
> > > > UniqueID = Distinct IDs;
> > > > Dump UniqueID;
> > > >
> > > > Is it the write/best way to get unique values of a particular column
> ?
> > > >
> > > > The reason why I am asking is, I ran the above script on my cluster,
> it
> > > > took around 30 minutes to finish.
> > > > However, for the same thing, when I wrote traditional java M/R code,
> it
> > > > took only 10 minutes.
> > > >
> > > > So I want to see what Pig is doing internally.
> > > > Can anyone tell what could be the reason for such behaviour ? How
> can I
> > > > decrease Pig Execution time ?
> > > >
> > > > Thanks,
> > > > Praveenesh
> > > >
> > >
> >
>

Re: Pig job is taking more time than Java M/R

Reply via email to