Sounds like a bug. I guess no one ever rely on specific split info before.
Please open a Jira.

Daniel

On Fri, Jan 6, 2012 at 10:21 PM, Alex Rovner <[email protected]> wrote:

> Additionally it looks like PigRecordReader is not incrementing the index in
> the PigSplit when dealing with CombinedInputFormat thus the index will be
> incorrect in either case.
>
> On Fri, Jan 6, 2012 at 4:50 PM, Alex Rovner <[email protected]> wrote:
>
> > Ran into this today. Using trunk (0.11)
> >
> > If you are using a custom loader and are trying to get input split
> > information In prepareToRead(), getWrappedSplit() is providing the fist
> > split instead of current.
> >
> > Checking the code confirms the suspicion:
> >
> > PigSplit.java:
> >
> >     public InputSplit getWrappedSplit() {
> >         return wrappedSplits[0];
> >     }
> >
> > Should be:
> >     public InputSplit getWrappedSplit() {
> >         return wrappedSplits[splitIndex];
> >     }
> >
> >
> > The side effect is that if you are trying to retrieve the current split
> > when pig is using CombinedInputFormat it incorrectly always returns the
> > first file in the list instead of the current one that its reading. I
> have
> > also confirmed it by outputing a log statement in the prepareToRead():
> >
> >     @Override
> >     public void prepareToRead(@SuppressWarnings("rawtypes") RecordReader
> > reader, PigSplit split)
> >             throws IOException {
> >         String path =
> >
> ((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString();
> >         partitions = getPartitions(table, path);
> >         log.info("Preparing to read: " + path);
> >         this.reader = reader;
> >     }
> >
> > 2012-01-06 16:27:24,165 INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> Current split being processed
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005:0+61870852012-01-06
> 16:27:24,180 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded
> native gpl library2012-01-06 16:27:24,183 INFO
> com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized
> native-lzo library [hadoop-lzo rev
> 2dd49ec41018ba4141b20edf28dbb43c0c07f373]2012-01-06 16:27:24,189 INFO
> com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing to read:
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-000052012-01-06
> 16:27:28,053 INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> Current split being processed
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00006:0+61814752012-01-06
> 16:27:28,056 INFO com.proclivitysystems.etl.pig.udf.loaders.HiveLoader:
> Preparing to read:
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005
> >
> >
> > Notice how the pig is correctly reporting the split but my "info"
> > statement is always reporting the first input split vs current.
> >
> > Bug? Jira? Patch?
> >
> > Thanks
> > Alex R
> >
>

Reply via email to