oh and yes, the simple fix is to specify the number of reducers you want, this works in any pig version:
g = group a by (a, b) parallel 10; (same goes for join, order, cogroup, and other blocking operators). On Tue, Dec 6, 2011 at 2:31 AM, Dmitriy Ryaboy <[email protected]> wrote: > You guys are talking about different versions of Pig. > > Parallelism is filled in using the heuristic Prashant describes in Pig > 0.8 or later. The behavior Ayon is seeing is consistent with Pig 0.7 > and earlier. Ayon, what's the Pig version? If it's less than 0.8, > please try to upgrade. If you are stuck on Amazon's EC2 (they are > still running 0.6), please contact them and ask them to upgrade. > > D > > On Tue, Dec 6, 2011 at 1:13 AM, Prashant Kommireddi <[email protected]> > wrote: >> Also, check "pig.exec.reducers.bytes.per.reducer" which should be set to >> 1000000000 and "pig.exec.reducers.max " which should be set to 999 by >> default. >> >> If those are fine too, may be you could set "default parallel" or use the >> PARALLEL keyword to manually set # of reducers. >> >> Thanks, >> Prashant >> >> On Tue, Dec 6, 2011 at 1:07 AM, Prashant Kommireddi >> <[email protected]>wrote: >> >>> What does the "HDFS_BYTES_READ" on JobTracker for this job say? >>> >>> -Prashant >>> >>> >>> On Tue, Dec 6, 2011 at 12:59 AM, Ayon Sinha <[email protected]> wrote: >>> >>>> The total input path size is ~60GB. That is 1023 files of appx. 64MB >>>> each. Total Map output bytes was 160GB. So why was there 1 reducer? Help me >>>> understand. >>>> >>>> -Ayon >>>> See My Photos on Flickr >>>> Also check out my Blog for answers to commonly asked questions. >>>> >>>> >>>> >>>> ________________________________ >>>> From: Prashant Kommireddi <[email protected]> >>>> To: Ayon Sinha <[email protected]> >>>> Cc: "[email protected]" <[email protected]> >>>> Sent: Tuesday, December 6, 2011 12:26 AM >>>> Subject: Re: How to see Pig MapReduce plan & classes >>>> >>>> Yes, when neither default parallelism nor PARALLEL is used Pig uses >>>> "pig.exec.reducers.bytes.per. >>>> reducer" to determine number of reducers. This is set to ~1GB -> which >>>> means 1 reducer per ~1GB of input data. >>>> >>>> You can try hadoop fs -dus <filepath> and you would see the size is less >>>> than 1GB. >>>> >>>> >>>> On Mon, Dec 5, 2011 at 11:59 PM, Ayon Sinha <[email protected]> wrote: >>>> >>>> > I have 1023 gz files of < 64MB each. >>>> > I think I see the reason in the log :( >>>> > >>>> > 2011-12-05 23:11:20,315 [main] INFO >>>> > >>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler >>>> > - Neither PARALLEL nor default parallelism is set for this job. Setting >>>> > number of reducers to 1 >>>> > >>>> > -Ayon >>>> > See My Photos on Flickr <http://www.flickr.com/photos/ayonsinha/> >>>> > Also check out my Blog for answers to commonly asked questions.< >>>> http://dailyadvisor.blogspot.com> >>>> > >>>> > ------------------------------ >>>> > *From:* Prashant Kommireddi <[email protected]> >>>> > *To:* [email protected]; Ayon Sinha <[email protected]> >>>> > *Sent:* Monday, December 5, 2011 11:56 PM >>>> > *Subject:* Re: How to see Pig MapReduce plan & classes >>>> > >>>> > What is the total size of your input dataset? Less than 1GB? Pig spawns >>>> 1 >>>> > reducer for each gigabyte of input data. >>>> > >>>> > -Prashant Kommireddi >>>> > >>>> > On Mon, Dec 5, 2011 at 11:53 PM, Ayon Sinha <[email protected]> >>>> wrote: >>>> > >>>> > Hi, >>>> > I have this script whose stage 1 has n maps where n = # of input splits >>>> (# >>>> > gz files) but has 1 reducer. I need to understand why my script causes 1 >>>> > reducer. When I think about how I'd do it in Java MapReduce, I dont see >>>> why >>>> > there would be a single reducer in stage 1. >>>> > >>>> > register /home/ayon/udfs.jar; >>>> > >>>> > a = load '$input' using PigStorage() as (a:chararray, b:chararray, >>>> c:int, >>>> > d:chararray); >>>> > >>>> > g = group a by (a, b); >>>> > >>>> > g = foreach g { >>>> > x = order $1 by c; >>>> > generate group.a, group.b, x; >>>> > }; >>>> > >>>> > >>>> > u = foreach g generate myUDF($2) as triplets; >>>> > describe u; >>>> > dump u; >>>> > >>>> > Do you see any reason there should be 1 reducer at any stage? How do I >>>> > debug this? Where are the generated classes and plan? >>>> > >>>> > -Ayon >>>> > See My Photos on Flickr >>>> > Also check out my Blog for answers to commonly asked questions. >>>> > >>>> > >>>> > >>>> > >>>> > >>>> >>> >>>
