What does the "HDFS_BYTES_READ" on JobTracker for this job say? -Prashant
On Tue, Dec 6, 2011 at 12:59 AM, Ayon Sinha <[email protected]> wrote: > The total input path size is ~60GB. That is 1023 files of appx. 64MB each. > Total Map output bytes was 160GB. So why was there 1 reducer? Help me > understand. > > -Ayon > See My Photos on Flickr > Also check out my Blog for answers to commonly asked questions. > > > > ________________________________ > From: Prashant Kommireddi <[email protected]> > To: Ayon Sinha <[email protected]> > Cc: "[email protected]" <[email protected]> > Sent: Tuesday, December 6, 2011 12:26 AM > Subject: Re: How to see Pig MapReduce plan & classes > > Yes, when neither default parallelism nor PARALLEL is used Pig uses > "pig.exec.reducers.bytes.per. > reducer" to determine number of reducers. This is set to ~1GB -> which > means 1 reducer per ~1GB of input data. > > You can try hadoop fs -dus <filepath> and you would see the size is less > than 1GB. > > > On Mon, Dec 5, 2011 at 11:59 PM, Ayon Sinha <[email protected]> wrote: > > > I have 1023 gz files of < 64MB each. > > I think I see the reason in the log :( > > > > 2011-12-05 23:11:20,315 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > > - Neither PARALLEL nor default parallelism is set for this job. Setting > > number of reducers to 1 > > > > -Ayon > > See My Photos on Flickr <http://www.flickr.com/photos/ayonsinha/> > > Also check out my Blog for answers to commonly asked questions.< > http://dailyadvisor.blogspot.com> > > > > ------------------------------ > > *From:* Prashant Kommireddi <[email protected]> > > *To:* [email protected]; Ayon Sinha <[email protected]> > > *Sent:* Monday, December 5, 2011 11:56 PM > > *Subject:* Re: How to see Pig MapReduce plan & classes > > > > What is the total size of your input dataset? Less than 1GB? Pig spawns 1 > > reducer for each gigabyte of input data. > > > > -Prashant Kommireddi > > > > On Mon, Dec 5, 2011 at 11:53 PM, Ayon Sinha <[email protected]> wrote: > > > > Hi, > > I have this script whose stage 1 has n maps where n = # of input splits > (# > > gz files) but has 1 reducer. I need to understand why my script causes 1 > > reducer. When I think about how I'd do it in Java MapReduce, I dont see > why > > there would be a single reducer in stage 1. > > > > register /home/ayon/udfs.jar; > > > > a = load '$input' using PigStorage() as (a:chararray, b:chararray, c:int, > > d:chararray); > > > > g = group a by (a, b); > > > > g = foreach g { > > x = order $1 by c; > > generate group.a, group.b, x; > > }; > > > > > > u = foreach g generate myUDF($2) as triplets; > > describe u; > > dump u; > > > > Do you see any reason there should be 1 reducer at any stage? How do I > > debug this? Where are the generated classes and plan? > > > > -Ayon > > See My Photos on Flickr > > Also check out my Blog for answers to commonly asked questions. > > > > > > > > > > >
