I have 1023 gz files of < 64MB each. I think I see the reason in the log :(
2011-12-05 23:11:20,315 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Neither PARALLEL nor default parallelism is set for this job. Setting number of reducers to 1 -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. ________________________________ From: Prashant Kommireddi <[email protected]> To: [email protected]; Ayon Sinha <[email protected]> Sent: Monday, December 5, 2011 11:56 PM Subject: Re: How to see Pig MapReduce plan & classes What is the total size of your input dataset? Less than 1GB? Pig spawns 1 reducer for each gigabyte of input data. -Prashant Kommireddi On Mon, Dec 5, 2011 at 11:53 PM, Ayon Sinha <[email protected]> wrote: Hi, >I have this script whose stage 1 has n maps where n = # of input splits (# gz >files) but has 1 reducer. I need to understand why my script causes 1 reducer. >When I think about how I'd do it in Java MapReduce, I dont see why there would >be a single reducer in stage 1. > >register /home/ayon/udfs.jar; > >a = load '$input' using PigStorage() as (a:chararray, b:chararray, c:int, >d:chararray); > >g = group a by (a, b); > >g = foreach g { > x = order $1 by c; > generate group.a, group.b, x; > }; > > >u = foreach g generate myUDF($2) as triplets; >describe u; >dump u; > >Do you see any reason there should be 1 reducer at any stage? How do I debug >this? Where are the generated classes and plan? > >-Ayon >See My Photos on Flickr >Also check out my Blog for answers to commonly asked questions. >
