You guys are talking about different versions of Pig.

Parallelism is filled in using the heuristic Prashant describes in Pig
0.8 or later. The behavior Ayon is seeing is consistent with Pig 0.7
and earlier. Ayon, what's the Pig version? If it's less than 0.8,
please try to upgrade. If you are stuck on Amazon's EC2 (they are
still running 0.6), please contact them and ask them to upgrade.

D

On Tue, Dec 6, 2011 at 1:13 AM, Prashant Kommireddi <[email protected]> wrote:
> Also, check "pig.exec.reducers.bytes.per.reducer" which should be set to
> 1000000000 and "pig.exec.reducers.max " which should be set to 999 by
> default.
>
> If those are fine too, may be you could set "default parallel" or use the
> PARALLEL keyword to manually set # of reducers.
>
> Thanks,
> Prashant
>
> On Tue, Dec 6, 2011 at 1:07 AM, Prashant Kommireddi 
> <[email protected]>wrote:
>
>> What does the "HDFS_BYTES_READ" on JobTracker for this job say?
>>
>> -Prashant
>>
>>
>> On Tue, Dec 6, 2011 at 12:59 AM, Ayon Sinha <[email protected]> wrote:
>>
>>> The total input path size is ~60GB. That is 1023 files of appx. 64MB
>>> each. Total Map output bytes was 160GB. So why was there 1 reducer? Help me
>>> understand.
>>>
>>> -Ayon
>>> See My Photos on Flickr
>>> Also check out my Blog for answers to commonly asked questions.
>>>
>>>
>>>
>>> ________________________________
>>>  From: Prashant Kommireddi <[email protected]>
>>> To: Ayon Sinha <[email protected]>
>>> Cc: "[email protected]" <[email protected]>
>>> Sent: Tuesday, December 6, 2011 12:26 AM
>>> Subject: Re: How to see Pig MapReduce plan & classes
>>>
>>> Yes, when neither default parallelism nor PARALLEL is used Pig uses
>>> "pig.exec.reducers.bytes.per.
>>> reducer" to determine number of reducers. This is set to ~1GB -> which
>>> means 1 reducer per ~1GB of input data.
>>>
>>> You can try hadoop fs -dus <filepath> and you would see the size is less
>>> than 1GB.
>>>
>>>
>>> On Mon, Dec 5, 2011 at 11:59 PM, Ayon Sinha <[email protected]> wrote:
>>>
>>> > I have 1023 gz files of < 64MB each.
>>> > I think I see the reason in the log :(
>>> >
>>> > 2011-12-05 23:11:20,315 [main] INFO
>>> >
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>>> > - Neither PARALLEL nor default parallelism is set for this job. Setting
>>> > number of reducers to 1
>>> >
>>> > -Ayon
>>> > See My Photos on Flickr <http://www.flickr.com/photos/ayonsinha/>
>>> > Also check out my Blog for answers to commonly asked questions.<
>>> http://dailyadvisor.blogspot.com>
>>> >
>>> >   ------------------------------
>>> > *From:* Prashant Kommireddi <[email protected]>
>>> > *To:* [email protected]; Ayon Sinha <[email protected]>
>>> > *Sent:* Monday, December 5, 2011 11:56 PM
>>> > *Subject:* Re: How to see Pig MapReduce plan & classes
>>> >
>>> > What is the total size of your input dataset? Less than 1GB? Pig spawns
>>> 1
>>> > reducer for each gigabyte of input data.
>>> >
>>> > -Prashant Kommireddi
>>> >
>>> > On Mon, Dec 5, 2011 at 11:53 PM, Ayon Sinha <[email protected]>
>>> wrote:
>>> >
>>> > Hi,
>>> > I have this script whose stage 1 has n maps where n = # of input splits
>>> (#
>>> > gz files) but has 1 reducer. I need to understand why my script causes 1
>>> > reducer. When I think about how I'd do it in Java MapReduce, I dont see
>>> why
>>> > there would be a single reducer in stage 1.
>>> >
>>> > register /home/ayon/udfs.jar;
>>> >
>>> > a = load '$input' using PigStorage() as (a:chararray, b:chararray,
>>> c:int,
>>> > d:chararray);
>>> >
>>> > g = group a by (a, b);
>>> >
>>> > g = foreach g {
>>> >       x = order $1 by c;
>>> >       generate group.a, group.b, x;
>>> >       };
>>> >
>>> >
>>> > u = foreach g generate myUDF($2) as triplets;
>>> > describe u;
>>> > dump u;
>>> >
>>> > Do you see any reason there should be 1 reducer at any stage? How do I
>>> > debug this? Where are the generated classes and plan?
>>> >
>>> > -Ayon
>>> > See My Photos on Flickr
>>> > Also check out my Blog for answers to commonly asked questions.
>>> >
>>> >
>>> >
>>> >
>>> >
>>>
>>
>>

Reply via email to