Hi,
I noticed the following as I'm learning Pig, which I thought I'd share to get
some insight on. It seems to me that Pig cannot (automatically) distinguish
between an empty tuple and a tuple with a single null value. Example:
data = LOAD 'testusers.dat' as (user, empty:tuple(), partial:tuple(p));
DUMP data;
(1,(),())
(2,(),(4))
(3,(),(5))
more = FOREACH data GENERATE user, empty, partial, (), (null), (null,null);
DESCRIBE more;
more: {user: bytearray,empty: (),partial: (p:
bytearray),(),(bytearray),(bytearray,bytearray)}
ILLUSTRATE more;
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| more | user:bytearray | empty:tuple() | partial:tuple(p:bytearray)
| :tuple() | :tuple(:bytearray) | :tuple(:bytearray,:bytearray)
|
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| | 1 | () | ()
| () | () | (, )
|
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
DUMP more;
(1,(),(),(),(),(,))
(2,(),(4),(),(),(,))
(3,(),(5),(),(),(,))
STORE more INTO 'out';
1 () () () () (,)
2 () (4) () () (,)
3 () (5) () () (,)
(Pig 0.9.1 in local mode).
It seems to me that an empty tuple and a tuple with a single null value are
both represented as (). It also seems like explicitly declaring the schema
allows a Pig programmer to interpret () in either manner, which is good.
Interestingly, if 'partial' is instead defined as 'partial:tuple(),' then Pig
can still handle the data; I guess it automatically adds fields to tuple when
it encounters them in each row.
Thoughts?
________________________________
The information transmitted is intended only for the person or entity to which
it is addressed and may contain confidential, proprietary, and/or privileged
material. Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon this information by persons or entities other
than the intended recipient is prohibited. If you received this in error,
please contact the sender and delete the material from all computers.