Ran into this today. Using trunk (0.11)
If you are using a custom loader and are trying to get input split
information In prepareToRead(), getWrappedSplit() is providing the fist
split instead of current.
Checking the code confirms the suspicion:
PigSplit.java:
public InputSplit getWrappedSplit() {
return wrappedSplits[0];
}
Should be:
public InputSplit getWrappedSplit() {
return wrappedSplits[splitIndex];
}
The side effect is that if you are trying to retrieve the current split
when pig is using CombinedInputFormat it incorrectly always returns the
first file in the list instead of the current one that its reading. I have
also confirmed it by outputing a log statement in the prepareToRead():
@Override
public void prepareToRead(@SuppressWarnings("rawtypes") RecordReader
reader, PigSplit split)
throws IOException {
String path =
((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString();
partitions = getPartitions(table, path);
log.info("Preparing to read: " + path);
this.reader = reader;
}
2012-01-06 16:27:24,165 INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
Current split being processed
hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005:0+6187085
2012-01-06 16:27:24,180 INFO
com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl
library
2012-01-06 16:27:24,183 INFO com.hadoop.compression.lzo.LzoCodec:
Successfully loaded & initialized native-lzo library [hadoop-lzo rev
2dd49ec41018ba4141b20edf28dbb43c0c07f373]
2012-01-06 16:27:24,189 INFO
com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing to
read:
hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005
2012-01-06 16:27:28,053 INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
Current split being processed
hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00006:0+6181475
2012-01-06 16:27:28,056 INFO
com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing to
read:
hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-00005
Notice how the pig is correctly reporting the split but my "info" statement
is always reporting the first input split vs current.
Bug? Jira? Patch?
Thanks
Alex R