That does seem like a bug, thanks for investigating it. Could you file a jira?

On Thu, Dec 15, 2011 at 5:42 PM, Adam Portley <[email protected]> wrote:
> I'm seeing some strange behavior but I don't know if it's a bug.  I have a
> pig script that looks something like:
>
> REGISTER myjar.jar
> raw = LOAD 'mydata' USING myLoader();
> partial = FOREACH raw GENERATE Column0;
> streamed = stream partial through `/bin/echo` as (mySchema);
> STORE streamed INTO 'myFile';
>
> When I run this script (with pig 0.9.1) I see:
>
> Pig features used in the script: STREAMING
> 2011-12-15 23:36:07,485 [main] INFO
>  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
> for raw: $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12
> 2011-12-15 23:36:07,575 [main] INFO  org.apache.hadoop.hdfs.DFSClient -
> Created HDFS_DELEGATION_TOKEN token...
> ...
>
> and pruning works as expected.  But if I remove the schema specifier from
> the streaming operator:
> streamed = stream partial through `/bin/echo`;
>
> then I see:
> 2011-12-15 23:43:07,706 [main] INFO
>  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: STREAMING
> 2011-12-15 23:43:07,765 [main] INFO  org.apache.hadoop.hdfs.DFSClient -
> Created HDFS_DELEGATION_TOKEN token...
>
> and pig tries to load the entire data set.  Is this a bug?  It seems strange
> that the output of streaming would affect upstream pruning.
>
> Digging through the code I see that ColumnPruneHelper::check() builds a
> sub-plan ending in foreach but then it gives ColumnDependencyVisitor the
> full plan.  Maybe it should only use the sub-plan?
>
> Thanks,
> --Adam
>

Reply via email to