I've confirmed this behavior in 8.1 and the fact that it's fixed in trunk (didn't check 9).
On Thu, Jun 16, 2011 at 12:00 PM, Shubham Chopra <[email protected]> wrote: > Hi Daniel, > > I am seeing this behaviour with 0.8.1. > > Consider the an input file named a containing the following: > 1|2|3 > 3||4 > > I start pig in the local mode and then use the following script: > a = load 'a' using PigStorage('|'); > b = group a by $0; > c = foreach b generate 'Test' as name, flatten(group), SUM(a.$0) as s0, > SUM(a.$1) as s1, SUM(a.$2) as s2; > dump c; > > The above script does not use the combiner. > > However, the following script does: > a = load 'a' using PigStorage('|'); > b = group a by $0; > c = foreach b generate flatten(group), SUM(a.$0) as s0, SUM(a.$1) as s1, > SUM(a.$2) as s2; > dump c; > > This script uses the combiner. > > I pinpointed the difference to using or not using a constant in the foreach > statement. Is this an expected behavior? I was thinking the decision to use > a combiner depends on UDFs implementing the algebraic interface. Why is the > constant projection stopping the combiner from being used? > > Thanks, > Shubham. > > On Thu, Jun 16, 2011 at 2:38 PM, Daniel Dai <[email protected]> wrote: > >> Do you mean "d = group c by (var1, var2); "? If so, I can see the combiner >> being used. Which version of Pig are you using? >> >> Daniel >> >> >> On 06/16/2011 11:13 AM, Shubham Chopra wrote: >> >>> Hi, >>> >>> My pig query is roughly the following: >>> >>> register some_lib.jar >>> a = load 'somefile' using CustomUDF(); >>> b = foreach a generate CustomProjectionUDF(); >>> c = foreach b generate var1, var2, var3; >>> d = group b by (var1, var2); >>> e = foreach d generate flatten(group), SUM(c.var1), SUM(c.var2), >>> SUM(c.var3); >>> store e into 'file'; >>> >>> I was expecting to see the combiner being used, but the optimizer did not >>> use a combiner. The following is the output I see (version 0.8.1) >>> INFO executionengine.**HExecutionEngine: pig.usenewlogicalplan is set to >>> true. >>> New logical plan will be used. >>> INFO executionengine.**HExecutionEngine: (Name: agg: >>> Store(hdfs://machine:9000/**SomeFile:PigStorage('|')) - scope-4353 >>> Operator >>> Key: scope-4353) >>> INFO mapReduceLayer.MRCompiler: File concatenation threshold: 100 >>> optimistic? false >>> INFO mapReduceLayer.**MultiQueryOptimizer: MR plan size before >>> optimization: 1 >>> INFO mapReduceLayer.**MultiQueryOptimizer: MR plan size after >>> optimization: 1 >>> INFO mapReduceLayer.**AccumulatorOptimizer: Reducer is to run in >>> accumulative >>> mode. >>> INFO pigstats.ScriptState: Pig script settings are added to the job >>> INFO mapReduceLayer.**JobControlCompiler: BytesPerReducer=1000000000 >>> maxReducers=999 totalInputFileSize=611579950 >>> INFO mapReduceLayer.**JobControlCompiler: Neither PARALLEL nor default >>> parallelism is set for this job. Setting number of reducers to 1 >>> INFO mapReduceLayer.**MapReduceLauncher: 1 map-reduce job(s) waiting for >>> submission. >>> >>> How can I enforce the use of combiner here? >>> >>> Thanks, >>> Shubham. >>> >> >> >
