Technically, the block locations are provided by the InputSplit which in the FileInputFormat case, is provided by the FileSystem Interface.
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputSplit.html The thing to realize here is that the FileSystem implementation is provided at runtime - so the InputSplit class is responsible to create a FileSystem implementation using reflection, and then call the getBlockLocations of on a given file or set of files which the input split is corresponding to. I think your confusion here lies in the fact that the input splits create a filesystem, however, they dont know what the filesystem implementation actually is - they only rely on the abstract contract, which provides a set of block locations. See the FileSystem abstract class for details on that. On Fri, Sep 27, 2013 at 7:02 PM, Peyman Mohajerian <[email protected]>wrote: > For the JobClient to compute the input splits doesn't it need to contact > Name Node. Only Name Node knows where the splits are, how can it compute it > without that additional call? > > > On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal <[email protected]>wrote: > >> The input splits are not copied, only the information on the location of >> the splits is copied to the jobtracker so that it can assign tasktrackers >> which are local to the split. >> >> Check the Job Initialization section at >> >> http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/ >> >> To create the list of tasks to run, the job scheduler first retrieves >> the input splits computed by the JobClient from the shared filesystem >> (step 6). It then creates one map task for each split. The number of reduce >> tasks to create is determined by the mapred.reduce.tasks property in the >> JobConf, which is set by the setNumReduceTasks() method, and the >> scheduler simply creates this number of reduce tasks to be run. Tasks are >> given IDs at this point. >> >> Best Regards, >> Sonal >> Nube Technologies <http://www.nubetech.co> >> >> <http://in.linkedin.com/in/sonalgoyal> >> >> >> >> >> On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <[email protected]> wrote: >> >>> Hi >>> I have attached the anatomy of MR from definitive guide. >>> >>> In step 6 it says JT/Scheduler retrieve input splits computed by the >>> client from hdfs. >>> >>> In the above line it refers to as the client computes input splits. >>> >>> 1. Why does the JT/Scheduler retrieve the input splits and what does it >>> do. >>> If it is retrieving the input split does this mean it goes to the block >>> and reads each record >>> and gets the record back to JT. If so this is a lot of data movement for >>> large files. >>> which is not data locality. so i m getting confused. >>> >>> 2. How does the client know how to calculate the input splits. >>> >>> Any help please. >>> Thanks >>> Sai >>> >> >> > -- Jay Vyas http://jayunit100.blogspot.com
