Sounds like a bug. I guess no one ever rely on specific split info before. Please open a Jira.
Daniel On Fri, Jan 6, 2012 at 10:21 PM, Alex Rovner <[email protected]> wrote: > Additionally it looks like PigRecordReader is not incrementing the index in > the PigSplit when dealing with CombinedInputFormat thus the index will be > incorrect in either case. > > On Fri, Jan 6, 2012 at 4:50 PM, Alex Rovner <[email protected]> wrote: > > > Ran into this today. Using trunk (0.11) > > > > If you are using a custom loader and are trying to get input split > > information In prepareToRead(), getWrappedSplit() is providing the fist > > split instead of current. > > > > Checking the code confirms the suspicion: > > > > PigSplit.java: > > > > public InputSplit getWrappedSplit() { > > return wrappedSplits[0]; > > } > > > > Should be: > > public InputSplit getWrappedSplit() { > > return wrappedSplits[splitIndex]; > > } > > > > > > The side effect is that if you are trying to retrieve the current split > > when pig is using CombinedInputFormat it incorrectly always returns the > > first file in the list instead of the current one that its reading. I > have > > also confirmed it by outputing a log statement in the prepareToRead(): > > > > @Override > > public void prepareToRead(@SuppressWarnings("rawtypes") RecordReader > > reader, PigSplit split) > > throws IOException { > > String path = > > > ((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString(); > > partitions = getPartitions(table, path); > > log.info("Preparing to read: " + path); > > this.reader = reader; > > } > > > > 2012-01-06 16:27:24,165 INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader: > Current split being processed > hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005:0+61870852012-01-06 > 16:27:24,180 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded > native gpl library2012-01-06 16:27:24,183 INFO > com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized > native-lzo library [hadoop-lzo rev > 2dd49ec41018ba4141b20edf28dbb43c0c07f373]2012-01-06 16:27:24,189 INFO > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing to read: > hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-000052012-01-06 > 16:27:28,053 INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader: > Current split being processed > hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00006:0+61814752012-01-06 > 16:27:28,056 INFO com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: > Preparing to read: > hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005 > > > > > > Notice how the pig is correctly reporting the split but my "info" > > statement is always reporting the first input split vs current. > > > > Bug? Jira? Patch? > > > > Thanks > > Alex R > > >
