Seems after FLATTEN, the rows with null values get dropped.
I have two test files:
% cat test1.txt
1 a b
2 c d
3 e f
% cat test2.txt
1 x
2 y
6 z
8 w
I'm trying to cogroup the two on the first column:
A = LOAD 'test1.txt' AS (id, f1, f2);
B = LOAD 'test2.txt' AS (id, f3);
C = COGROUP A BY id, B BY id;
DUMP C;
(1,{(1,a,b)},{(1,x)})
(2,{(2,c,d)},{(2,y)})
(3,{(3,e,f)},{})
(6,{},{(6,z)})
(8,{},{(8,w)})
D = FOREACH C GENERATE group, FLATTEN(A.(f1, f2)), FLATTEN(B.f3);
DUMP D;
(1,a,b,x)
(2,c,d,y)
E = FOREACH C GENERATE group, A.(f1, f2), B.f3;
DUMP E
(1,{(a,b)},{(x)})
(2,{(c,d)},{(y)})
(3,{(e,f)},{})
(6,{},{(z)})
(8,{},{(w)})
You see if I do FLATTEN, all the rows with null values are all missing (in
D). If I don't do FLATTEN, as in E, I have all the rows but not flattened,
obviously. What I want as the end result is:
(1,a,b,x)
(2,c,d,y)
(3,e,f,{})
(6,,,z)
(8,,,w)
How can I get that? Thanks.
Dexin
P.S.
I realize I could do FULL JOIN, but the problem is that after join, I
wouldn't know which id is null, I would have to do many if then in the
following generate command and I hope I can avoid that. E.g.,
C = JOIN A BY id FULL, B BY id;
DUMP C
(1,a,b,1,x)
(2,c,d,2,y)
(3,e,f,,)
(,,,6,z)
(,,,8,w)
DESCRIBE C;
C: {A::id: bytearray,A::f1: bytearray,A::f2: bytearray,B::id:
bytearray,B::f3: bytearray}
Sometimes A::id is null, sometimes B::id null, I always only want the
non-null id in my output.