Re: Retrieve and compute input splits

Jay Vyas Fri, 27 Sep 2013 17:06:31 -0700

Technically, the block locations are provided by the InputSplit which in
the FileInputFormat case, is provided by the FileSystem Interface.


http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputSplit.html

The thing to realize here is that the FileSystem implementation is provided
at runtime - so the InputSplit class is responsible to create a FileSystem
implementation using reflection, and then call the getBlockLocations of on
a given file or set of files which the input split is corresponding to.

I think your confusion here lies in the fact that the input splits create a
filesystem, however, they dont know what the filesystem implementation
actually is - they only rely on the abstract contract, which provides a set
of block locations.

See the FileSystem abstract class for details on that.


On Fri, Sep 27, 2013 at 7:02 PM, Peyman Mohajerian <[email protected]>wrote:

> For the JobClient to compute the input splits doesn't it need to contact
> Name Node. Only Name Node knows where the splits are, how can it compute it
> without that additional call?
>
>
> On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal <[email protected]>wrote:
>
>> The input splits are not copied, only the information on the location of
>> the splits is copied to the jobtracker so that it can assign tasktrackers
>> which are local to the split.
>>
>> Check the Job Initialization section at
>>
>> http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/
>>
>> To create the list of tasks to run, the job scheduler first retrieves
>> the input splits computed by the JobClient from the shared filesystem
>> (step 6). It then creates one map task for each split. The number of reduce
>> tasks to create is determined by the mapred.reduce.tasks property in the
>> JobConf, which is set by the setNumReduceTasks() method, and the
>> scheduler simply creates this number of reduce tasks to be run. Tasks are
>> given IDs at this point.
>>
>> Best Regards,
>> Sonal
>> Nube Technologies <http://www.nubetech.co>
>>
>>  <http://in.linkedin.com/in/sonalgoyal>
>>
>>
>>
>>
>> On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <[email protected]> wrote:
>>
>>> Hi
>>> I have attached the anatomy of MR from definitive guide.
>>>
>>> In step 6 it says JT/Scheduler  retrieve  input splits computed by the
>>> client from hdfs.
>>>
>>> In the above line it refers to as the client computes input splits.
>>>
>>> 1. Why does the JT/Scheduler retrieve the input splits and what does it
>>> do.
>>> If it is retrieving the input split does this mean it goes to the block
>>> and reads each record
>>> and gets the record back to JT. If so this is a lot of data movement for
>>> large files.
>>> which is not data locality. so i m getting confused.
>>>
>>> 2. How does the client know how to calculate the input splits.
>>>
>>> Any help please.
>>> Thanks
>>> Sai
>>>
>>
>>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Retrieve and compute input splits

Reply via email to