That does seem like a bug, thanks for investigating it. Could you file a jira?
On Thu, Dec 15, 2011 at 5:42 PM, Adam Portley <[email protected]> wrote: > I'm seeing some strange behavior but I don't know if it's a bug. I have a > pig script that looks something like: > > REGISTER myjar.jar > raw = LOAD 'mydata' USING myLoader(); > partial = FOREACH raw GENERATE Column0; > streamed = stream partial through `/bin/echo` as (mySchema); > STORE streamed INTO 'myFile'; > > When I run this script (with pig 0.9.1) I see: > > Pig features used in the script: STREAMING > 2011-12-15 23:36:07,485 [main] INFO > org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned > for raw: $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12 > 2011-12-15 23:36:07,575 [main] INFO org.apache.hadoop.hdfs.DFSClient - > Created HDFS_DELEGATION_TOKEN token... > ... > > and pruning works as expected. But if I remove the schema specifier from > the streaming operator: > streamed = stream partial through `/bin/echo`; > > then I see: > 2011-12-15 23:43:07,706 [main] INFO > org.apache.pig.tools.pigstats.ScriptState - Pig features used in the > script: STREAMING > 2011-12-15 23:43:07,765 [main] INFO org.apache.hadoop.hdfs.DFSClient - > Created HDFS_DELEGATION_TOKEN token... > > and pig tries to load the entire data set. Is this a bug? It seems strange > that the output of streaming would affect upstream pruning. > > Digging through the code I see that ColumnPruneHelper::check() builds a > sub-plan ending in foreach but then it gives ColumnDependencyVisitor the > full plan. Maybe it should only use the sub-plan? > > Thanks, > --Adam >
