Re: getWrappedSplit() is incorrectly returning the first split

Alex Rovner Fri, 06 Jan 2012 22:22:36 -0800

Additionally it looks like PigRecordReader is not incrementing the index in
the PigSplit when dealing with CombinedInputFormat thus the index will be
incorrect in either case.


On Fri, Jan 6, 2012 at 4:50 PM, Alex Rovner <[email protected]> wrote:

> Ran into this today. Using trunk (0.11)
>
> If you are using a custom loader and are trying to get input split
> information In prepareToRead(), getWrappedSplit() is providing the fist
> split instead of current.
>
> Checking the code confirms the suspicion:
>
> PigSplit.java:
>
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[0];
>     }
>
> Should be:
>     public InputSplit getWrappedSplit() {
>         return wrappedSplits[splitIndex];
>     }
>
>
> The side effect is that if you are trying to retrieve the current split
> when pig is using CombinedInputFormat it incorrectly always returns the
> first file in the list instead of the current one that its reading. I have
> also confirmed it by outputing a log statement in the prepareToRead():
>
>     @Override
>     public void prepareToRead(@SuppressWarnings("rawtypes") RecordReader
> reader, PigSplit split)
>             throws IOException {
>         String path =
> ((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString();
>         partitions = getPartitions(table, path);
>         log.info("Preparing to read: " + path);
>         this.reader = reader;
>     }
>
> 2012-01-06 16:27:24,165 INFO 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader: 
> Current split being processed 
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005:0+61870852012-01-06
>  16:27:24,180 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded 
> native gpl library2012-01-06 16:27:24,183 INFO 
> com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized 
> native-lzo library [hadoop-lzo rev 
> 2dd49ec41018ba4141b20edf28dbb43c0c07f373]2012-01-06 16:27:24,189 INFO 
> com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing to read: 
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-000052012-01-06
>  16:27:28,053 INFO 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader: 
> Current split being processed 
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00006:0+61814752012-01-06
>  16:27:28,056 INFO com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: 
> Preparing to read: 
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005
>
>
> Notice how the pig is correctly reporting the split but my "info"
> statement is always reporting the first input split vs current.
>
> Bug? Jira? Patch?
>
> Thanks
> Alex R
>

Re: getWrappedSplit() is incorrectly returning the first split

Reply via email to