Hi Pig users,

Is there an easy/efficient way to sample an inner bag? For example, with input 
in a relation like

(id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)})
(id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)})
(id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)})

I’d like to sample 1/3 the elements of the bags, and get something like 
(ignoring the non-determinism)
(id1,att1,{(x,0.999749968742)})
(id1,att2,{(b,0.04)})
(id2,att1,{(b,0.05)})

I have a circumlocution that seems to work using flatten+ group but that looks 
ugly to me:

tfidf1 = load '$tfidf' as (id: chararray,
                          att: chararray,
                          pairs: {pair: (word: chararray, value: double)});

flat_tfidf = foreach tfidf1 generate id, att, FLATTEN(pairs);
sample_flat_tfidf = sample flat_tfidf 0.33;
tfidf2 = group sample_flat_tfidf by (id, att);

tfidf = foreach tfidf2 {
   pairs = foreach sample_flat_tfidf generate pairs::word, pairs::value;
   generate group.id, group.att, pairs;
};

Can someone suggest a better way to do this?  Many thanks!

William F Dowling
Senior Technologist

Thomson Reuters



Reply via email to