streaming without schema disables pruning?

Adam Portley Thu, 15 Dec 2011 17:43:34 -0800

I'm seeing some strange behavior but I don't know if it's a bug. I havea pig script that looks something like:


REGISTER myjar.jar
raw = LOAD 'mydata' USING myLoader();
partial = FOREACH raw GENERATE Column0;
streamed = stream partial through `/bin/echo` as (mySchema);
STORE streamed INTO 'myFile';


When I run this script (with pig 0.9.1) I see:

Pig features used in the script: STREAMING

2011-12-15 23:36:07,485 [main] INFOorg.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns prunedfor raw: $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $122011-12-15 23:36:07,575 [main] INFO org.apache.hadoop.hdfs.DFSClient -Created HDFS_DELEGATION_TOKEN token...

...

and pruning works as expected. But if I remove the schema specifierfrom the streaming operator:

streamed = stream partial through `/bin/echo`;

then I see:

2011-12-15 23:43:07,706 [main] INFOorg.apache.pig.tools.pigstats.ScriptState - Pig features used in thescript: STREAMING2011-12-15 23:43:07,765 [main] INFO org.apache.hadoop.hdfs.DFSClient -Created HDFS_DELEGATION_TOKEN token...

and pig tries to load the entire data set. Is this a bug? It seemsstrange that the output of streaming would affect upstream pruning.

Digging through the code I see that ColumnPruneHelper::check() builds asub-plan ending in foreach but then it gives ColumnDependencyVisitor thefull plan. Maybe it should only use the sub-plan?


Thanks,
--Adam

streaming without schema disables pruning?

Reply via email to