Context: I have a bunch of files living in HDFS, and I think my jobs are
failing on one of them... I want to output the files that the job is failing
on.

I thought that I could just make my own LoadFunc that followed the same
methodology as PigStorage, but caught exceptions and logged the file that
was given...this isn't working, however. I tried returning loadLocation, but
that is the globbed input, not the input to the mapper. I also tried reading
mapreduce.map.file.input and map.file.input from the Job given to
setLocation, but both were null... I think this is where some of my
ignorance as to pig's internal workings is coming into play, as I'm not sure
when files are deglobbed and the splits are actually read. I tried using
getLocations() from the PigSplit passed to prepareToRead but that was just
the glob as well...

My next thought would be to read make a RecordReader that outputs the file
associated with its splits (as I assume that this should have to have the
specific files it is processing?), but I thought I'd ask if there was a
cleaner way before doing that...

Thanks!
Jon

Reply via email to