Hi Daniel,
I am seeing this behaviour with 0.8.1.
Consider the an input file named a containing the following:
1|2|3
3||4
I start pig in the local mode and then use the following script:
a = load 'a' using PigStorage('|');
b = group a by $0;
c = foreach b generate 'Test' as name, flatten(group), SUM(a.$0) as s0,
SUM(a.$1) as s1, SUM(a.$2) as s2;
dump c;
The above script does not use the combiner.
However, the following script does:
a = load 'a' using PigStorage('|');
b = group a by $0;
c = foreach b generate flatten(group), SUM(a.$0) as s0, SUM(a.$1) as s1,
SUM(a.$2) as s2;
dump c;
This script uses the combiner.
I pinpointed the difference to using or not using a constant in the foreach
statement. Is this an expected behavior? I was thinking the decision to use
a combiner depends on UDFs implementing the algebraic interface. Why is the
constant projection stopping the combiner from being used?
Thanks,
Shubham.
On Thu, Jun 16, 2011 at 2:38 PM, Daniel Dai <[email protected]> wrote:
> Do you mean "d = group c by (var1, var2); "? If so, I can see the combiner
> being used. Which version of Pig are you using?
>
> Daniel
>
>
> On 06/16/2011 11:13 AM, Shubham Chopra wrote:
>
>> Hi,
>>
>> My pig query is roughly the following:
>>
>> register some_lib.jar
>> a = load 'somefile' using CustomUDF();
>> b = foreach a generate CustomProjectionUDF();
>> c = foreach b generate var1, var2, var3;
>> d = group b by (var1, var2);
>> e = foreach d generate flatten(group), SUM(c.var1), SUM(c.var2),
>> SUM(c.var3);
>> store e into 'file';
>>
>> I was expecting to see the combiner being used, but the optimizer did not
>> use a combiner. The following is the output I see (version 0.8.1)
>> INFO executionengine.**HExecutionEngine: pig.usenewlogicalplan is set to
>> true.
>> New logical plan will be used.
>> INFO executionengine.**HExecutionEngine: (Name: agg:
>> Store(hdfs://machine:9000/**SomeFile:PigStorage('|')) - scope-4353
>> Operator
>> Key: scope-4353)
>> INFO mapReduceLayer.MRCompiler: File concatenation threshold: 100
>> optimistic? false
>> INFO mapReduceLayer.**MultiQueryOptimizer: MR plan size before
>> optimization: 1
>> INFO mapReduceLayer.**MultiQueryOptimizer: MR plan size after
>> optimization: 1
>> INFO mapReduceLayer.**AccumulatorOptimizer: Reducer is to run in
>> accumulative
>> mode.
>> INFO pigstats.ScriptState: Pig script settings are added to the job
>> INFO mapReduceLayer.**JobControlCompiler: BytesPerReducer=1000000000
>> maxReducers=999 totalInputFileSize=611579950
>> INFO mapReduceLayer.**JobControlCompiler: Neither PARALLEL nor default
>> parallelism is set for this job. Setting number of reducers to 1
>> INFO mapReduceLayer.**MapReduceLauncher: 1 map-reduce job(s) waiting for
>> submission.
>>
>> How can I enforce the use of combiner here?
>>
>> Thanks,
>> Shubham.
>>
>
>