On 12/30/10 4:35 PM, "Dexin Wang" <[email protected]> wrote:
> Seems after FLATTEN, the rows with null values get dropped.
>
What you are seeing is the expected/documented behavior of flatten -
http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Flatten+Operator
"Note that the flatten of empty bag will result in that row being discarded"
(Note that its 'empty bag' not 'null').
>
> You see if I do FLATTEN, all the rows with null values are all missing (in
> D). If I don't do FLATTEN, as in E, I have all the rows but not flattened,
> obviously. What I want as the end result is:
>
> (1,a,b,x)
> (2,c,d,y)
> (3,e,f,{})
> (6,,,z)
> (8,,,w)
>
> How can I get that? Thanks.
>
D = FOREACH C GENERATE group, FLATTEN((IsEmpty(A) ? null : A.(f1,f2) )),
FLATTEN((IsEmpty(B) ? null : B.f3));
> I realize I could do FULL JOIN, but the problem is that after join, I
> wouldn't know which id is null, I would have to do many if then in the
> following generate command and I hope I can avoid that. E.g.,
>
> C = JOIN A BY id FULL, B BY id;
> DUMP C
> (1,a,b,1,x)
> (2,c,d,2,y)
> (3,e,f,,)
> (,,,6,z)
> (,,,8,w)
>
> DESCRIBE C;
> C: {A::id: bytearray,A::f1: bytearray,A::f2: bytearray,B::id:
> bytearray,B::f3: bytearray}
>
> Sometimes A::id is null, sometimes B::id null, I always only want the
> non-null id in my output.
>
You can get this by using the conditional expression (called bincond in pig
documents) (? : ).
E = foreach C generate (A::id is null ? B::id : A::id), A::F1, A::F2, B::F3;
-Thejas