So, I made a dumb little python script that parses a pig script, see's what
stores there are, and then uses pig's describe function to get the schema of
the object being stored and then uses that info to make a new file that has
the proper loader/schema. I felt this was useful because I found myself
making intermediate stores, and then it being pretty difficult to make the
proper loader if there were a lot of columns (especially remembering the
type).
However, it seems that the result from DESCRIBE is not adequate to do a
load. For example, I have test.txt which is literally just random pairs of
numbers
ie
1 2
1 3
1 4
2 5
2 6
3 7
3 8
4 9
5 10
6 11
7 12
8 13
8 14
8 15
and so on.
I do this:
t1 = LOAD 'test.txt' AS (n1:int, n2:int);
t2 = GROUP t1 BY n1;
t3 = GROUP t2 BY group;
DESCRIBE t3;
STORE t3 INTO 'output.txt';
The query runs without a hitch, however, there is an issue
This is what describe gives:
t3: {group: int,t2: {group: int,t1: {n1: int,n2: int}}}
However, this won't let you load the file...
the output has form
x{(y,{(a,b)}
And I'm not really sure how to go about even creating a loader that would
properly load it. Suffice it to say, it seems pretty complicated to store
and then load anything that isn't a flat file...is this by design? Is there
an easier way to go from the schema, as per describe, to the schema you'd
use to load it?
I'm curious what people do in practice. I could probably extend the script I
made to go from describe schema -> loading schema (if the pig loader can
load things that have brackets and all that?), but I want to know what the
limitations are.
As always, I apologize if there is an easy answer to this. Thanks.