I'm optimizing a somewhat large pig job.

One of the intermediate steps is a group which we use moving forward.

The data right now looks like:

0 {(1),(2),(3),(4)}

which has a second column of a bag of tuples each with one element.

Wouldn't it be more efficient to store this as:

0 (1,2,3,4)

??

I can't figure out how to do this…

--test2.cvs
0,1
0,2
0,3
0,4


data = LOAD 'test2.csv' USING PigStorage(',') AS (source:bytearray,
target:bytearray);

grouped = GROUP data by source;
thin = FOREACH grouped GENERATE $0, $1.($1);

STORE thin           INTO 'thin.dmp';


-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Reply via email to