FLATTEN is kind of quirky. If you FLATTEN(null), it will return null, but if you FLATTEN a bag that is empty (ie size=0), it will throw away the row. I would have your UDF return an empty bag and let the flatten wipe it out.
2012/3/1 Dexin Wang <[email protected]> > Hi, > > I have a UDF that parses a line and then return a bag, and sometimes the > line is bad so I'm returning null in the UDF. In my pig script, I'd like to > filter those nulls like this: > > raw = LOAD 'raw_input' AS (line:chararray); > parsed = FOREACH raw GENERATE FLATTEN(MyUDF(line)); -- get two fields in > the tuple: id and name > DUMP parsed; > > (id1,name1) > (id2,name2) > () > (id3,name3) > > parsed_no_nulls = FILTER parsed BY id IS NOT NULL; > DUMP parsed_no_nulls; > > (id1,name1) > (id2,name2) > (id3,name3) > > This works, but I'm getting this warning: > > WARN > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger > - > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: > Attempt to access field which was not found in the input > > When I try to use IsEmpty to filter, I get this error "Cannot test a NULL > for emptiness". > > What's the correct way to filter out these null bags returned from my UDF? > > Thanks. > Dexin >
