I'm trying to play around with Amazon EMR, and I currently have self hosted
Cassandra as the source of data.  I was going to try to do: Cassandra -> S3
-> EMR.  I've traced my problems to PigStorage.  At this point I can
recreate my problem "locally" without involving S3 or Amazon.

In my local test environment I have this script:

data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS
(key:chararray, columns:bag {column:tuple (name, value)});

STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage();


I can verify that HDFS file looks vaguely correct (\t separated fields,
return separated lines, my data is in the right spots).


Then if I do:

data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS (key:chararray,
columns:bag {column:tuple (name, value)});

keys = FOREACH data GENERATE key;

DUMP keys;


I can see that data is wrong.  In the dump sometimes I see keys, sometimes
I see columns, and sometimes I see a mismatch of keys/columns lumped
together.


As far as I can tell PigStorage is unable to parse the data it just
persisted.  I've tried pig 0.8, 0.9 and 0.10 with the same results.


In terms of my data:

key = URI (ASCII)

columns = binary UUID -> JSON (ASCII)


Any ideas?  Next I guess I'll see what kind of debugging is in pig in the
STORE/LOAD processes.


Thanks!


will

Reply via email to