Hello.

FYI.
"The way HDFS has been set up, it breaks down very large files into large 
blocks(for example, measuring 128MB), and stores three copies of these blocks 
ondifferent nodes in the cluster. HDFS has no awareness of the content of 
thesefiles. In YARN, when a MapReduce job is started, the Resource Manager 
(thecluster resource management and job scheduling facility) creates 
anApplication Master daemon to look after the lifecycle of the job. (In Hadoop 
1,the JobTracker monitored individual jobs as well as handling job 
­schedulingand cluster resource management. One of the first things the 
Application Masterdoes is determine which file blocks are needed for 
processing. The Application Master requests details from the NameNode on where 
the replicas of the needed data blocks are stored. Using the location data for 
the file blocks, the Application Master makes requests to the Resource Manager 
to have map tasks process specific blocks on the slave nodes where they’re 
stored. The key to efficient MapReduce processing is that, wherever possible, 
data isprocessed locally — on the slave node where it’s stored.Before looking 
at how the data blocks are processed, you need to look moreclosely at how 
Hadoop stores data. In Hadoop, files are composed of individualrecords, which 
are ultimately processed one-by-one by mapper tasks. Forexample, the sample 
data set we use in this book contains information aboutcompleted flights within 
the United States between 1987 and 2008. We have onelarge file for each year, 
and within every file, each individual line represents asingle flight. In other 
words, one line represents one record. Now, rememberthat the block size for the 
Hadoop cluster is 64MB, which means that the lightdata files are broken into 
chunks of exactly 64MB.
Do you see the problem? If each map task processes all records in a 
specificdata block, what happens to those records that span block 
boundaries?File blocks are exactly 64MB (or whatever you set the block size to 
be), andbecause HDFS has no conception of what’s inside the file blocks, it 
can’t gaugewhen a record might spill over into another block. To solve this 
problem,Hadoop uses a logical representation of the data stored in file blocks, 
known asinput splits. When a MapReduce job client calculates the input splits, 
it figuresout where the first whole record in a block begins and where the last 
recordin the block ends. In cases where the last record in a block is 
incomplete, theinput split includes location information for the next block and 
the byte offsetof the data needed to complete the record.  You can configure 
the Application Master daemon (or JobTracker, if you’re inHadoop 1) to 
calculate the input splits instead of the job client, which wouldbe faster for 
jobs processing a large number of data blocks.MapReduce data processing is 
driven by this concept of input splits. Thenumber of input splits that are 
calculated for a specific application determinesthe number of mapper tasks. 
Each of these mapper tasks is assigned, wherepossible, to a slave node where 
the input split is stored. The Resource Manager(or JobTracker, if you’re in 
Hadoop 1) does its best to ensure that input splitsare processed locally."      
                                    sic
Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,Rafael Coss, and Roman 
B. Melnyk


Mark Charts

 

     On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte 
<[email protected]> wrote:
   

 Hi,

Check this post: 
http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D


2014-12-17 15:16 GMT+01:00 Todd <[email protected]>:
Hi Hadoopers,

I got a question about how many blocks does one input split have? It is random 
or the number can be configured or fixed(can't be changed)?
Thanks!



   

Reply via email to