Thanks. Both worked fine. I think I'll make a MyFlatten that doesn't drop the empty bag. Say you want to COGROUP 3 or more bags, you would have to do a many COGROUP or JOIN, then do IsEmpty or bincond every time. Istead, with MyFlatten, I would do:
X = COGROUP A BY id, B BY id, C BY id, D BY id; Y = FOREACH X GENERATE group, FLATTEN(A.(f1, f2)), FLATTEN(B.(f3,f4,f5)), FLATTEN(C.f6), FLATTEN(D.f7); code will be a lot conciser and cleaner. On Thu, Dec 30, 2010 at 6:46 PM, Thejas M Nair <[email protected]> wrote: > > > > On 12/30/10 4:35 PM, "Dexin Wang" <[email protected]> wrote: > > > Seems after FLATTEN, the rows with null values get dropped. > > > What you are seeing is the expected/documented behavior of flatten - > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Flatten+Operator > "Note that the flatten of empty bag will result in that row being > discarded" > (Note that its 'empty bag' not 'null'). > > > > > > You see if I do FLATTEN, all the rows with null values are all missing > (in > > D). If I don't do FLATTEN, as in E, I have all the rows but not > flattened, > > obviously. What I want as the end result is: > > > > (1,a,b,x) > > (2,c,d,y) > > (3,e,f,{}) > > (6,,,z) > > (8,,,w) > > > > How can I get that? Thanks. > > > > D = FOREACH C GENERATE group, FLATTEN((IsEmpty(A) ? null : A.(f1,f2) )), > FLATTEN((IsEmpty(B) ? null : B.f3)); > > > > I realize I could do FULL JOIN, but the problem is that after join, I > > wouldn't know which id is null, I would have to do many if then in the > > following generate command and I hope I can avoid that. E.g., > > > > C = JOIN A BY id FULL, B BY id; > > DUMP C > > (1,a,b,1,x) > > (2,c,d,2,y) > > (3,e,f,,) > > (,,,6,z) > > (,,,8,w) > > > > > > > DESCRIBE C; > > C: {A::id: bytearray,A::f1: bytearray,A::f2: bytearray,B::id: > > bytearray,B::f3: bytearray} > > > > Sometimes A::id is null, sometimes B::id null, I always only want the > > non-null id in my output. > > > > You can get this by using the conditional expression (called bincond in pig > documents) (? : ). > > E = foreach C generate (A::id is null ? B::id : A::id), A::F1, A::F2, > B::F3; > > -Thejas > > > >
