Hi,

I've just hit a bug that's present in all versions of Pig that I've
tested. If I generate multiple relations from different projections of
the same grouped input, then union them together and do another group
with a composite key, the local rearrange step chooses the wrong
fields to group by. Versions 0.8.1 and 0.9.1 generate incorrect
output; trunk crashes with a "duplicate uid in schema" error. I
encountered the problem in a fairly complex script, but managed to
boil it down to the following test case:

---- bug.pig

a = LOAD 'bug.in' AS (x:int, y:chararray, z:chararray);

SPLIT a INTO a1 IF x==1, a2 IF x==2, a3 IF x==3;

grouped = COGROUP a1 BY y, a2 BY y, a3 BY y;
projected = FOREACH grouped GENERATE a1.z AS z1, a2.z AS z2, a3.z AS z3;

b1 = FOREACH projected GENERATE FLATTEN(z1) AS first, FLATTEN(z2) AS second;
b2 = FOREACH projected GENERATE FLATTEN(z2) AS first, FLATTEN(z3) AS second;

c = UNION b1, b2;
-- results are as expected until this point
d = GROUP c BY (first,second);
STORE d INTO 'bug.out';

---- Input:

1       foo     line1
2       foo     line2
3       foo     line3
3       foo     line4

---- Expected output:

(line1,line2)   {(line1,line2)}
(line2,line3)   {(line2,line3)}
(line2,line4)   {(line2,line4)}

---- Actual output from 0.8/0.9
---- notice that the group is being done on (first,first) instead of
(first,second):

(line1,line1)   {(line1,line2)}
(line2,line2)   {(line2,line3),(line2,line4)}

---- Stack trace from trunk:

2012-01-09 13:25:55,230 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: COGROUP,GROUP_BY,UNION
2012-01-09 13:25:55,258 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 2270: Logical plan invalid state: duplicate uid in schema :
first#298:chararray,second#298:chararray
2012-01-09 13:25:55,258 [main] ERROR org.apache.pig.tools.grunt.Grunt
- org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000:
Error processing rule LoadTypeCastInserter
        at 
org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
        at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:287)
        at org.apache.pig.PigServer.compilePp(PigServer.java:1317)
        at 
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1254)
        at org.apache.pig.PigServer.execute(PigServer.java:1246)
        at org.apache.pig.PigServer.executeBatch(PigServer.java:362)
        at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:131)
        at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:192)
        at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
        at org.apache.pig.Main.run(Main.java:589)
        at org.apache.pig.Main.main(Main.java:148)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR
2270: Logical plan invalid state: duplicate uid in schema :
first#298:chararray,second#298:chararray
        at 
org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:225)
        at 
org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:160)
        at 
org.apache.pig.newplan.logical.relational.LOUnion.accept(LOUnion.java:182)
        at 
org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
        at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
        at 
org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
        at 
org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
        ... 16 more

It's possible to work around the problem by performing multiple JOINs
instead of a single COGROUP and multiple FLATTENs, but the resulting
plan uses more map-reduce jobs and does a lot of redundant work.

Is this a known issue or limitation? (I searched JIRA and the list
archives, but didn't see anything that looked relevant.) If not, I'll
open an issue.

Thanks,
-- David

Reply via email to