The JIRA https://issues.apache.org/jira/browse/PIG-514 has brought up
an interesting issue of how we handle empty bags in foreach
statements. The current pig semantic for foreach is that it always
produces a cross produce of all of the fields in its projection
list. So:
B = foreach A generate $0, $1;
technically produces a cross product of $0 and $1. Since both $0 and
$1 are (generally) single valued this produces one row. In cases
where they are multi-valued (generate flatten($0), $1) then the cross
product produces multiple rows. In cases where any of the elements
are an empty bag, the cross product produces no row. That is,
emptyness is equivalent to a 0 in multiplication, it swallows
everything.
Because of this, pig is currently implemented such that as soon as it
sees an empty bag in an output it stops, because there's no point in
continuing. So, scripts like:
A = load 'myfile';
B = group A by $0;
C = foreach B generate {
D = filter A by $1 > 5;
E = filter D by $1 < 5;
generate COUNT(E.$0), group;
}
will generate no output all. It would be reasonable to expect that
the above would produce a list of all entries from the first field of
'myfile', along with a 0 (for the count).
A couple of questions about this:
1) Should we keep this empty bag as a blackhole semantic? It strikes
me as reasonable that instead of being a blackhole it instead
produces a NULL value. This would make outer joins somewhat easier
to do. I'm not sure what other side effects it would have.
2) If we do keep the blackhole semantic, should UDFs get a chance to
evaluate an empty bag? The current implementation certainly seems to
violate the law of least astonishment. However, if we extend this to
UDFs we need to think carefully about what else it needs extended
to. In particular, the semantic for streaming is that if we have no
data, we will not envoke the external binary. It seems we should be
consistent throughout. Any empty bag should either mean that we stop
processing there and return nothing, or that we allow user provided
code a chance to run, even without input.
Thoughts? Insights?
Alan.