This may be better asked on one of the other hadoop lists, but as the job in question is done with Pig I thought I would start here. I have a nightly job that runs against around 1000 gzip log files. Around once a week one of the map tasks will fail reporting some form of gzip error/corruption of the input file. The job still completes as successful as we have set mapred.max.map.failures.percent = 1 to allow a few input files to fail without aborting the entire job.
Sometimes I can find the name of the corrupt input file in the logs available for the map task from the Map/Reduce Administration page on port 50030 of the name node. However most of the time the name is not in these logs. I can find the map task id of the form attempt_201102141346_0097_m_000000_0, but would like to know how if possible to find the name of the corrupted input file. Is there a Pig/Haddop file/log somewhere that associates the attempt id with the input file? Thanks, Scott
