Context: I have a bunch of files living in HDFS, and I think my jobs are failing on one of them... I want to output the files that the job is failing on.
I thought that I could just make my own LoadFunc that followed the same methodology as PigStorage, but caught exceptions and logged the file that was given...this isn't working, however. I tried returning loadLocation, but that is the globbed input, not the input to the mapper. I also tried reading mapreduce.map.file.input and map.file.input from the Job given to setLocation, but both were null... I think this is where some of my ignorance as to pig's internal workings is coming into play, as I'm not sure when files are deglobbed and the splits are actually read. I tried using getLocations() from the PigSplit passed to prepareToRead but that was just the glob as well... My next thought would be to read make a RecordReader that outputs the file associated with its splits (as I assume that this should have to have the specific files it is processing?), but I thought I'd ask if there was a cleaner way before doing that... Thanks! Jon
