oh and yes, the simple fix is to specify the number of reducers you
want, this works in any pig version:

g = group a by (a, b) parallel 10;

(same goes for join, order, cogroup, and other blocking operators).



On Tue, Dec 6, 2011 at 2:31 AM, Dmitriy Ryaboy <[email protected]> wrote:
> You guys are talking about different versions of Pig.
>
> Parallelism is filled in using the heuristic Prashant describes in Pig
> 0.8 or later. The behavior Ayon is seeing is consistent with Pig 0.7
> and earlier. Ayon, what's the Pig version? If it's less than 0.8,
> please try to upgrade. If you are stuck on Amazon's EC2 (they are
> still running 0.6), please contact them and ask them to upgrade.
>
> D
>
> On Tue, Dec 6, 2011 at 1:13 AM, Prashant Kommireddi <[email protected]> 
> wrote:
>> Also, check "pig.exec.reducers.bytes.per.reducer" which should be set to
>> 1000000000 and "pig.exec.reducers.max " which should be set to 999 by
>> default.
>>
>> If those are fine too, may be you could set "default parallel" or use the
>> PARALLEL keyword to manually set # of reducers.
>>
>> Thanks,
>> Prashant
>>
>> On Tue, Dec 6, 2011 at 1:07 AM, Prashant Kommireddi 
>> <[email protected]>wrote:
>>
>>> What does the "HDFS_BYTES_READ" on JobTracker for this job say?
>>>
>>> -Prashant
>>>
>>>
>>> On Tue, Dec 6, 2011 at 12:59 AM, Ayon Sinha <[email protected]> wrote:
>>>
>>>> The total input path size is ~60GB. That is 1023 files of appx. 64MB
>>>> each. Total Map output bytes was 160GB. So why was there 1 reducer? Help me
>>>> understand.
>>>>
>>>> -Ayon
>>>> See My Photos on Flickr
>>>> Also check out my Blog for answers to commonly asked questions.
>>>>
>>>>
>>>>
>>>> ________________________________
>>>>  From: Prashant Kommireddi <[email protected]>
>>>> To: Ayon Sinha <[email protected]>
>>>> Cc: "[email protected]" <[email protected]>
>>>> Sent: Tuesday, December 6, 2011 12:26 AM
>>>> Subject: Re: How to see Pig MapReduce plan & classes
>>>>
>>>> Yes, when neither default parallelism nor PARALLEL is used Pig uses
>>>> "pig.exec.reducers.bytes.per.
>>>> reducer" to determine number of reducers. This is set to ~1GB -> which
>>>> means 1 reducer per ~1GB of input data.
>>>>
>>>> You can try hadoop fs -dus <filepath> and you would see the size is less
>>>> than 1GB.
>>>>
>>>>
>>>> On Mon, Dec 5, 2011 at 11:59 PM, Ayon Sinha <[email protected]> wrote:
>>>>
>>>> > I have 1023 gz files of < 64MB each.
>>>> > I think I see the reason in the log :(
>>>> >
>>>> > 2011-12-05 23:11:20,315 [main] INFO
>>>> >
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>>>> > - Neither PARALLEL nor default parallelism is set for this job. Setting
>>>> > number of reducers to 1
>>>> >
>>>> > -Ayon
>>>> > See My Photos on Flickr <http://www.flickr.com/photos/ayonsinha/>
>>>> > Also check out my Blog for answers to commonly asked questions.<
>>>> http://dailyadvisor.blogspot.com>
>>>> >
>>>> >   ------------------------------
>>>> > *From:* Prashant Kommireddi <[email protected]>
>>>> > *To:* [email protected]; Ayon Sinha <[email protected]>
>>>> > *Sent:* Monday, December 5, 2011 11:56 PM
>>>> > *Subject:* Re: How to see Pig MapReduce plan & classes
>>>> >
>>>> > What is the total size of your input dataset? Less than 1GB? Pig spawns
>>>> 1
>>>> > reducer for each gigabyte of input data.
>>>> >
>>>> > -Prashant Kommireddi
>>>> >
>>>> > On Mon, Dec 5, 2011 at 11:53 PM, Ayon Sinha <[email protected]>
>>>> wrote:
>>>> >
>>>> > Hi,
>>>> > I have this script whose stage 1 has n maps where n = # of input splits
>>>> (#
>>>> > gz files) but has 1 reducer. I need to understand why my script causes 1
>>>> > reducer. When I think about how I'd do it in Java MapReduce, I dont see
>>>> why
>>>> > there would be a single reducer in stage 1.
>>>> >
>>>> > register /home/ayon/udfs.jar;
>>>> >
>>>> > a = load '$input' using PigStorage() as (a:chararray, b:chararray,
>>>> c:int,
>>>> > d:chararray);
>>>> >
>>>> > g = group a by (a, b);
>>>> >
>>>> > g = foreach g {
>>>> >       x = order $1 by c;
>>>> >       generate group.a, group.b, x;
>>>> >       };
>>>> >
>>>> >
>>>> > u = foreach g generate myUDF($2) as triplets;
>>>> > describe u;
>>>> > dump u;
>>>> >
>>>> > Do you see any reason there should be 1 reducer at any stage? How do I
>>>> > debug this? Where are the generated classes and plan?
>>>> >
>>>> > -Ayon
>>>> > See My Photos on Flickr
>>>> > Also check out my Blog for answers to commonly asked questions.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>
>>>

Reply via email to