Using PARALLEL I can reduce the total number of reducers and can reduce the execution time.
Thanks, Praveenesh On Mon, Jan 16, 2012 at 12:43 PM, Prashant Kommireddi <[email protected]>wrote: > That explains it. Pig computes # of reducers using a heuristic based on > input dataset size (1 reducer per GB). You would always want to use > PARALLEL if the data being forwarded to reducers is not a lot. > > Please take a look at (PARALLEL syntax) : > http://pig.apache.org/docs/r0.9.1/basic.html#distinct > > Thanks, > Prashant > > On Sun, Jan 15, 2012 at 10:59 PM, praveenesh kumar <[email protected] > >wrote: > > > I am using Apache Pig version 0.11.0-SNAPSHOT (r1225753) build from trunk > > and Hadoop 0.20.205 > > Nothing else was running that time on cluster that time. and there was no > > waiting for map-reduce slots. > > Only difference I saw was for my Java M/R job, only 40 reducers were > > running > > whereas my pig job was running 457 reducers. I guess it may be because of > > so many reducers running. > > Can I control number of reducers running ? > > > > Thanks, > > Praveenesh > > > > > > On Mon, Jan 16, 2012 at 11:42 AM, Prashant Kommireddi > > <[email protected]>wrote: > > > > > Hi Praveenesh, > > > > > > You can use 'EXPLAIN' to understand what Pig is doing under the hood > (MR > > > plan) > > > http://pig.apache.org/docs/r0.9.1/test.html#explain > > > > > > What version of Pig and Hadoop are you using? I have never seen such a > > huge > > > difference between Java MR and Pig. At the time you ran Pig, was the > > > cluster idle or did you have other jobs running at the same time? Did > you > > > make sure the job was not waiting on Map or Reduce slots being made > > > available? > > > > > > Thanks, > > > Prashant > > > > > > On Sun, Jan 15, 2012 at 9:47 PM, praveenesh kumar < > [email protected] > > > >wrote: > > > > > > > Hey Guys, > > > > > > > > Is there anyway through which I can see the M/R jobs that pig runs > > > > internally for a given pig script ? > > > > I wanted to get unique values for a particular column. > > > > > > > > For that I wrote the following script: > > > > > > > > Data = Load 'Data.csv' using PigStorage(','); > > > > IDs = FOREACH Data GENERATE $0; > > > > UniqueID = Distinct IDs; > > > > Dump UniqueID; > > > > > > > > Is it the write/best way to get unique values of a particular column > ? > > > > > > > > The reason why I am asking is, I ran the above script on my cluster, > it > > > > took around 30 minutes to finish. > > > > However, for the same thing, when I wrote traditional java M/R code, > it > > > > took only 10 minutes. > > > > > > > > So I want to see what Pig is doing internally. > > > > Can anyone tell what could be the reason for such behaviour ? How > can I > > > > decrease Pig Execution time ? > > > > > > > > Thanks, > > > > Praveenesh > > > > > > > > > >
