I'm trying to play around with Amazon EMR, and I currently have self hosted
Cassandra as the source of data. I was going to try to do: Cassandra -> S3
-> EMR. I've traced my problems to PigStorage. At this point I can
recreate my problem "locally" without involving S3 or Amazon.
In my local test environment I have this script:
data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS
(key:chararray, columns:bag {column:tuple (name, value)});
STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage();
I can verify that HDFS file looks vaguely correct (\t separated fields,
return separated lines, my data is in the right spots).
Then if I do:
data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS (key:chararray,
columns:bag {column:tuple (name, value)});
keys = FOREACH data GENERATE key;
DUMP keys;
I can see that data is wrong. In the dump sometimes I see keys, sometimes
I see columns, and sometimes I see a mismatch of keys/columns lumped
together.
As far as I can tell PigStorage is unable to parse the data it just
persisted. I've tried pig 0.8, 0.9 and 0.10 with the same results.
In terms of my data:
key = URI (ASCII)
columns = binary UUID -> JSON (ASCII)
Any ideas? Next I guess I'll see what kind of debugging is in pig in the
STORE/LOAD processes.
Thanks!
will