I'm optimizing a somewhat large pig job.
One of the intermediate steps is a group which we use moving forward.
The data right now looks like:
0 {(1),(2),(3),(4)}
which has a second column of a bag of tuples each with one element.
Wouldn't it be more efficient to store this as:
0 (1,2,3,4)
??
I can't figure out how to do this…
--test2.cvs
0,1
0,2
0,3
0,4
data = LOAD 'test2.csv' USING PigStorage(',') AS (source:bytearray,
target:bytearray);
grouped = GROUP data by source;
thin = FOREACH grouped GENERATE $0, $1.($1);
STORE thin INTO 'thin.dmp';
--
Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
Skype-in: *(415) 871-0687*