This is a common concern. There is a MR jira raised for the same.
https://issues.apache.org/jira/browse/MAPREDUCE-2076

One way I use to find which inputs went to map task is as below, :
a) Get the input spit locations from the task log;
b) Got to the location and from data node logs grep for the attempt id , you 
will get the block id from it.
c) On the input path to the mr do a :
hadoop fsck <input_path> -locations -blocks -files 
This will contain the block report you can search the blk id in this report to 
get the filename.

(fsck is bit expensive operation for namenode, so watch out for the location 
for which you are doing fsck. )

Would like to know if anyone else has a better way.

Thanks and Regards
 Vivek 


-----Original Message-----
From: Kester, Scott [mailto:[email protected]] 
Sent: Wednesday, February 16, 2011 8:22 PM
To: [email protected]
Subject: How to find input file associated with failed map task?

This may be better asked on one of the other hadoop lists, but as the job in 
question is done with Pig I thought I would start here.  I have a nightly job 
that runs against around 1000 gzip log files.  Around once a week one of the 
map tasks will fail reporting some form of gzip error/corruption of the input 
file. The job still completes as successful as we have set 
mapred.max.map.failures.percent = 1 to allow a few input files to fail without 
aborting the entire job.


 Sometimes I can find the name of the corrupt input file in the logs available 
for the map task from the Map/Reduce Administration page on port 50030 of the 
name node.  However most of the time the name is not in these logs.  I can find 
the map task id of the form attempt_201102141346_0097_m_000000_0, but would 
like to know how if possible to find the name of the corrupted input file.  Is 
there a Pig/Haddop file/log somewhere that associates the attempt id with the 
input file?

Thanks,
Scott

Reply via email to