One possibility is to introduce 'mode' in Pig with default value of 'strict'. Other values being 'non-strict' or potentially others. Another use case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently PigStorage cannot guarantee all the requirements imposed by Merge Join, but you can still use it in most cases. I dont recall all the details but discussion can be found at: https://issues.apache.org/jira/browse/PIG-1518
Ashutosh On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy <[email protected]> wrote: > Hi guys, > It seems like our 'collected' option for group is pretty limited. > Imagine I have the following (silly example) script: > > tweets = load 'tweets' using TweetLoader() as (id:long, uid:long, > text:chararray, ts:long); > happy_words = load 'happy_words' using HappyLoader() as (word:chararray); > > ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as > (ngram:chararray); > > -- get only happy ngrams, using replicated to avoid MR step > happy_ngrams = join ngrams by ngram, happy_words by word using > 'replicated'; > > -- find only happy tweets. We know ngrams that were exploded from a single > tweet > -- must be in the same mapper still, so in theory this should work > happy_tweets = group happy_ngrams by (id, uid) using 'collected'; > > > But this doesn't work, of course, because there's a whole mess of operators > between the load and the group, including a join, and nothing makes any > guarantees about (id, uid) being on the same mapper except for what the > user > knows about the data. > > What's the right approach to let the user force this through? > a) this is an edge case optimization that's more trouble than it is worth > b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to > disable sanity checks > c) using 'collected-its-cool-dmitriy-said-its-ok' > d) drop the checks altogether > e) something else? > > D >
