[jira] [Commented] (PIG-3931) DUMP should limit how much data it emits

Abhishek Agarwal (JIRA) Tue, 24 Jun 2014 12:37:45 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042572#comment-14042572
 ]


Abhishek Agarwal commented on PIG-3931:
---------------------------------------

+1 for having DUMP take limit as an additional argument. It is certainly more 
convenient. 

> DUMP should limit how much data it emits
> ----------------------------------------
>
>                 Key: PIG-3931
>                 URL: https://issues.apache.org/jira/browse/PIG-3931
>             Project: Pig
>          Issue Type: Improvement
>          Components: grunt, impl
>            Reporter: Philip (flip) Kromer
>            Priority: Minor
>              Labels: dump, grunt, inline, limit, nested, operator
>
> The DUMP command is fairly dangerous: leave a stray DUMP uncommented from 
> debugging your script on reduced data and it will spew a terabyte of data 
> into your console with no apology. 
> 1. By (configurable) default, DUMP should not emit more than 1MB of data
> 2. The DUMP statement should accept a limit on rows
> h3. Safety Valve limit on output size
> Pig should gain a pig.max_dump_bytes configuration variable imposing an 
> approximate upper bound on how much data DUMP will emit. Since a GROUP BY 
> statement can generate an extremely large bag, this safety valve limit should 
> be bytes and not rows. I propose a default of 1,000,000 bytes -- good for 
> about 1000 records of 1k each. Pig should emit a warning to the console if 
> the max_dump_bytes limit is hit. 
> This is a breaking change, but users shouldn't be using DUMP other than for 
> experimentation. Pig should favor the experimentation use case, and let the 
> foolhardy push the max_dump_bytes limit back up on their own.
> h3. DUMP can elegantly limit the number of rows
> Right now I have to write the following annoyingly-wordy statement:
> {code}
> dumpable = LIMIT teams 10 ; DUMP dumpable;
> {code}
> One approach would be to allow DUMP to accept an inline (nested) operator. 
> Assignment statements can have inline operators, but dump can't:
> {code}
> -- these work, which is so awesome:
> some = FOREACH (LIMIT teams 10) GENERATE team_id, park_id;
> some = GROUP (LIMIT teams 10) BY park_id;
> STORE (LIMIT teams 10) INTO '/tmp/some_teams';
> -- these don't work, but maybe they should:
> DUMP (LIMIT teams 10);
> DUMP (GROUP teams BY team_id);
> {code}
> Alternatively, DUMP could accept an argument:
> {code}
> dumpable = DUMP teams LIMIT 10;
> dumpable = DUMP teams LIMIT ALL;
> {code}
> The generated plan should be equivalent to that from `some = LIMIT teams 10 ; 
> DUMP some` so that optimizations on LIMIT kick in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PIG-3931) DUMP should limit how much data it emits

Reply via email to