Thanks, that's got it. I though't I'd disabled this, but I was setting "splitCombination" instead of "pig.spiltCombination."
Cheers! On Fri, Jul 15, 2011 at 4:53 PM, Xiaomeng Wan <[email protected]> wrote: > try "set pig.splitCombination false" in your pig script > > Shawn > > On Fri, Jul 15, 2011 at 1:42 PM, CJ Niemira <[email protected]> wrote: > > I'm attempting to get the PigStorageWithInputPath example ( > > http://wiki.apache.org/pig/PigStorageWithInputPath) working, but I must > be > > missing something. It works fine if I specify a single file, but if I use > a > > glob in my load command all of the records end up with the first filename > > only. > > > > > > My loadfunc class: > > > > package com.test; > > > > > > import java.io.IOException; > > > > import org.apache.hadoop.fs.Path; > > > > import org.apache.hadoop.mapreduce.RecordReader; > > > > import org.apache.hadoop.mapreduce.lib.input.FileSplit; > > > > import org.apache.log4j.Logger; > > > > import > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit; > > > > import org.apache.pig.builtin.PigStorage; > > > > import org.apache.pig.data.Tuple; > > > > > > public class PigStorageWithInputPath extends PigStorage { > > > > static Logger logger = > Logger.getLogger(PigStorageWithInputPath.class); > > > > Path path = null; > > > > > > @Override > > > > public void prepareToRead(@SuppressWarnings("rawtypes") RecordReader > > reader, PigSplit split) { > > > > super.prepareToRead(reader, split); > > > > path = ((FileSplit)split.getWrappedSplit()).getPath(); > > > > logger.info("path @prepareToRead="+path); > > > > } > > > > > > @Override > > > > public Tuple getNext() throws IOException { > > > > Tuple myTuple = super.getNext(); > > > > if (myTuple != null) > > > > myTuple.append(path.toString()); > > > > return myTuple; > > > > } > > > > } > > > > > > > > My test data is three files. Each file has three lines in the format > "file > > n\tline n". > > > > -bash-3.2$ hadoop fs -ls /test > > > > Found 3 items > > > > -rw-r--r-- 1 test supergroup 42 2011-07-15 15:00 /test/file1 > > > > -rw-r--r-- 1 test supergroup 42 2011-07-15 15:00 /test/file2 > > > > -rw-r--r-- 1 test supergroup 42 2011-07-15 15:00 /test/file3 > > > > > > > > My Pig script: > > > > register /tmp/udf.jar; > > > > a = load '/test/*' using com.test.PigStorageWithInputPath(); > > > > dump a; > > > > > > > > The output: > > > > (file 1,line 1,hdfs://localhost/test/file1) > > > > (file 1,line 2,hdfs://localhost/test/file1) > > > > (file 1,line 3,hdfs://localhost/test/file1) > > > > (file 2,line 1,hdfs://localhost/test/file1) > > > > (file 2,line 2,hdfs://localhost/test/file1) > > > > (file 2,line 3,hdfs://localhost/test/file1) > > > > (file 3,line 1,hdfs://localhost/test/file1) > > > > (file 3,line 2,hdfs://localhost/test/file1) > > > > (file 3,line 3,hdfs://localhost/test/file1) > > > > > > > > The last item in each line is "file1," even though the contents clearly > came > > from other files. What I'm expecting to see is this: > > > > (file 1,line 1,hdfs://localhost/test/file1) > > > > (file 1,line 2,hdfs://localhost/test/file1) > > > > (file 1,line 3,hdfs://localhost/test/file1) > > > > (file 2,line 1,hdfs://localhost/test/file2) > > > > (file 2,line 2,hdfs://localhost/test/file2) > > > > (file 2,line 3,hdfs://localhost/test/file2) > > > > (file 3,line 1,hdfs://localhost/test/file3) > > > > (file 3,line 2,hdfs://localhost/test/file3) > > > > (file 3,line 3,hdfs://localhost/test/file3) > > > > > > > > And the task attempt log: > > > > 2011-07-15 15:00:58,933 INFO org.apache.hadoop.util.NativeCodeLoader: > Loaded > > the native-hadoop library > > > > 2011-07-15 15:00:59,329 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: > > Initializing JVM Metrics with processName=MAP, sessionId= > > > > 2011-07-15 15:00:59,720 INFO com.test.PigStorageWithInputPath: path > > @prepareToRead=hdfs://localhost/test/file1 > > > > 2011-07-15 15:00:59,769 INFO com.test.PigStorageWithInputPath: path > > @prepareToRead=hdfs://localhost/test/file1 > > > > 2011-07-15 15:00:59,771 INFO com.test.PigStorageWithInputPath: path > > @prepareToRead=hdfs://localhost/test/file1 > > > > 2011-07-15 15:00:59,830 INFO org.apache.hadoop.mapred.Task: > > Task:attempt_201107111628_0027_m_000000_0 is done. And is in the process > of > > commiting > > > > 2011-07-15 15:01:00,851 INFO org.apache.hadoop.mapred.Task: Task > > attempt_201107111628_0027_m_000000_0 is allowed to commit now > > > > 2011-07-15 15:01:00,861 INFO > > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output > of > > task 'attempt_201107111628_0027_m_000000_0' to > > hdfs://localhost/tmp/temp-2143171191/tmp-286804436 > > > > 2011-07-15 15:01:00,875 INFO org.apache.hadoop.mapred.Task: Task > > 'attempt_201107111628_0027_m_000000_0' done. > > > > 2011-07-15 15:01:00,897 INFO org.apache.hadoop.mapred.TaskLogsTruncater: > > Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=- > > > > > > > > For the record, I'm doing this test against a CDH3U0 installation > (pig-0.8 + > > hadoop-0.20.2) in pseudo-distributed mode. > > > > > > Am I doing something wrong? > > >
