Daniel, Thank you very much for your answer with a concrete example. It did solve our problem!
Thanks again, Sang On Tue, Nov 23, 2010 at 1:27 PM, Daniel Dai <[email protected]> wrote: > I remember we did something similar before. FileSplit.getPath() does have a > hold of file name. > > Here is a sample code: > > public class PigStorageWithInputPath extends PigStorage { > Path path = null; > > @Override > public void prepareToRead(RecordReader reader, PigSplit split) { > super.prepareToRead(reader, split); > path = ((FileSplit)split.getWrappedSplit()).getPath(); > } > > @Override > public Tuple getNext() throws IOException { > Tuple myTuple = super.getNext(); > if (myTuple != null) > myTuple.append(path.toString()); > return myTuple; > } > } > > > Does it solves your problem? > > Daniel > > Sangchul Song wrote: >> >> Hi all, >> >> Our dataset consists of multiple files. The name of each file reflects >> the creation date of the file. (e.g. 20101031.dat, 20101101.dat, etc) >> We need this date information for all relations inside the file, but >> there is no date field. >> >> We first considered the possibility of accessing the file name through >> a UDF that implements LoadFunc, but it doesn't appear to be possible. >> In particular, 'location' in setLocation(String location, PigSplit >> split) only gives the original glob expression used in LOAD (such as >> '/test/data/*.dat'), and 'reader' in prepareToRead(RecordReader >> reader, PigSplit split) doesn't expose a method for file name access. >> >> Before we individually add the date field to every single file (which >> we want to leave as the last resort, considering the number of files >> we deal with), we were wondering if there's any way to access the file >> name within a pig script (including UDFs) especially when you load >> multiple files at the same time. Any help would be greatly >> appreciated. >> >> FYI, we are on Pig 0.7.0 running on top of Hadoop 0.20.2 >> >> Thanks, >> >> Sang >> > >
