[ https://issues.apache.org/jira/browse/PIG-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042572#comment-14042572 ]
Abhishek Agarwal commented on PIG-3931: --------------------------------------- +1 for having DUMP take limit as an additional argument. It is certainly more convenient. > DUMP should limit how much data it emits > ---------------------------------------- > > Key: PIG-3931 > URL: https://issues.apache.org/jira/browse/PIG-3931 > Project: Pig > Issue Type: Improvement > Components: grunt, impl > Reporter: Philip (flip) Kromer > Priority: Minor > Labels: dump, grunt, inline, limit, nested, operator > > The DUMP command is fairly dangerous: leave a stray DUMP uncommented from > debugging your script on reduced data and it will spew a terabyte of data > into your console with no apology. > 1. By (configurable) default, DUMP should not emit more than 1MB of data > 2. The DUMP statement should accept a limit on rows > h3. Safety Valve limit on output size > Pig should gain a pig.max_dump_bytes configuration variable imposing an > approximate upper bound on how much data DUMP will emit. Since a GROUP BY > statement can generate an extremely large bag, this safety valve limit should > be bytes and not rows. I propose a default of 1,000,000 bytes -- good for > about 1000 records of 1k each. Pig should emit a warning to the console if > the max_dump_bytes limit is hit. > This is a breaking change, but users shouldn't be using DUMP other than for > experimentation. Pig should favor the experimentation use case, and let the > foolhardy push the max_dump_bytes limit back up on their own. > h3. DUMP can elegantly limit the number of rows > Right now I have to write the following annoyingly-wordy statement: > {code} > dumpable = LIMIT teams 10 ; DUMP dumpable; > {code} > One approach would be to allow DUMP to accept an inline (nested) operator. > Assignment statements can have inline operators, but dump can't: > {code} > -- these work, which is so awesome: > some = FOREACH (LIMIT teams 10) GENERATE team_id, park_id; > some = GROUP (LIMIT teams 10) BY park_id; > STORE (LIMIT teams 10) INTO '/tmp/some_teams'; > -- these don't work, but maybe they should: > DUMP (LIMIT teams 10); > DUMP (GROUP teams BY team_id); > {code} > Alternatively, DUMP could accept an argument: > {code} > dumpable = DUMP teams LIMIT 10; > dumpable = DUMP teams LIMIT ALL; > {code} > The generated plan should be equivalent to that from `some = LIMIT teams 10 ; > DUMP some` so that optimizations on LIMIT kick in. -- This message was sent by Atlassian JIRA (v6.2#6252)