Hello I am interested to know the order in which input files will be processed by the map tasks of a given job.
*Example*: I am running Wordcount on input directory /ebooks/ containing say 10 .txt files On running the above job I would like to know at any point of time, what map tasks (mad tasks ids) on which nodes (ip address), were processing which file splits (actual file, range of offsets). Is it possible to hook into MR source code to obtain such details ? Please point me to the section of code I can get these details from? Based on logging and analyzing above details I might want to perform some pre-fetching to improve Map tasks performance. (I am not using HDFS, but a different FS which needs some performance fixing using pre-fetching or other techniques). TL;DR I want to be able to know the sequence/order in which different files will be accessed by map tasks for processing once a job is submitted to Hadoop v2 cluster. I am assuming some kind of FIFO scheduler module might be able to give me this information at file level? Looking forward to your reply. Thanks.
