Hi,

Our data contain tuples one of whose fields is a tuple containing a
bag field and we've seen the following exceptions when we access the
bag field:

java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot
be cast to org.apache.pig.data.DataBag
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:479)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:477)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:336)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:288)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:433)
        at

We can reproduce the exceptions using the following scripts.

1. A = LOAD 'test_input' as (a:int, T:(list:{B:(key:int, value:int)},
world:chararray) );
describe A;
/*
test_input contains:
12      ({(2,13),(4,5)}, 'hello')
24      ({(8,17),(9,11),(3,4)}, 'world')

and got A's schema as:
A: {a: int,T: (list: {B: (key: int,value: int)},world: chararray)}
*/

B = FOREACH A GENERATE  T.list, T.world;
describe B;
/*
got:
B: {list: {B: (key: int,value: int)},world: chararray}
*/

dump B;

2.
......

b = foreach a generate member_id, primary_email, year_born;
c = group b by member_id;
d = foreach c generate group as member_id, b;
e = group d by member_id;
f = foreach e generate group as member_id, d;
g = foreach f generate member_id as A, flatten(d);

h = foreach g generate $0 as A, $1 AS B, $2 AS C;
describe h;
/* get the following schema:
h: {A: int,B: int,C: {member_id: int,primary_email: chararray,year_born: int}}
*/

h = foreach h generate $0 as A, Swap($1, $2) AS T;
describe h;
/* We use Swap to generate a tuple out of the last two fields and got
the following schema
h: {A: int,T: (C: {member_id: int,primary_email: chararray,year_born:
int},B: int)}
*/
g = foreach h generate A, T.C;
describe g;

g = limit g 15;
dump g;

Is it a known issue?

Best,
Lin

Reply via email to