Hi guys,
It seems like our 'collected' option for group is pretty limited.
Imagine I have the following (silly example) script:

tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
text:chararray, ts:long);
happy_words = load 'happy_words' using HappyLoader() as (word:chararray);

ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
(ngram:chararray);

-- get only happy ngrams, using replicated to avoid MR step
happy_ngrams = join ngrams by ngram, happy_words by word using 'replicated';

-- find only happy tweets. We know ngrams that were exploded from a single
tweet
-- must be in the same mapper still, so in theory this should work
happy_tweets = group happy_ngrams by (id, uid) using 'collected';


But this doesn't work, of course, because there's a whole mess of operators
between the load and the group, including a join, and nothing makes any
guarantees about (id, uid) being on the same mapper except for what the user
knows about the data.

What's the right approach to let the user force this through?
a) this is an edge case optimization that's more trouble than it is worth
b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to
disable sanity checks
c) using 'collected-its-cool-dmitriy-said-its-ok'
d) drop the checks altogether
e) something else?

D

Reply via email to