I am using Apache Pig version 0.11.0-SNAPSHOT (r1225753) build from trunk and Hadoop 0.20.205 Nothing else was running that time on cluster that time. and there was no waiting for map-reduce slots. Only difference I saw was for my Java M/R job, only 40 reducers were running whereas my pig job was running 457 reducers. I guess it may be because of so many reducers running. Can I control number of reducers running ?
Thanks, Praveenesh On Mon, Jan 16, 2012 at 11:42 AM, Prashant Kommireddi <[email protected]>wrote: > Hi Praveenesh, > > You can use 'EXPLAIN' to understand what Pig is doing under the hood (MR > plan) > http://pig.apache.org/docs/r0.9.1/test.html#explain > > What version of Pig and Hadoop are you using? I have never seen such a huge > difference between Java MR and Pig. At the time you ran Pig, was the > cluster idle or did you have other jobs running at the same time? Did you > make sure the job was not waiting on Map or Reduce slots being made > available? > > Thanks, > Prashant > > On Sun, Jan 15, 2012 at 9:47 PM, praveenesh kumar <[email protected] > >wrote: > > > Hey Guys, > > > > Is there anyway through which I can see the M/R jobs that pig runs > > internally for a given pig script ? > > I wanted to get unique values for a particular column. > > > > For that I wrote the following script: > > > > Data = Load 'Data.csv' using PigStorage(','); > > IDs = FOREACH Data GENERATE $0; > > UniqueID = Distinct IDs; > > Dump UniqueID; > > > > Is it the write/best way to get unique values of a particular column ? > > > > The reason why I am asking is, I ran the above script on my cluster, it > > took around 30 minutes to finish. > > However, for the same thing, when I wrote traditional java M/R code, it > > took only 10 minutes. > > > > So I want to see what Pig is doing internally. > > Can anyone tell what could be the reason for such behaviour ? How can I > > decrease Pig Execution time ? > > > > Thanks, > > Praveenesh > > >
