Hello.
FYI.
"The way HDFS has been set up, it breaks down very large files into large
blocks(for example, measuring 128MB), and stores three copies of these blocks
ondifferent nodes in the cluster. HDFS has no awareness of the content of
thesefiles. In YARN, when a MapReduce job is started, the Resource Manager
(thecluster resource management and job scheduling facility) creates
anApplication Master daemon to look after the lifecycle of the job. (In Hadoop
1,the JobTracker monitored individual jobs as well as handling job
schedulingand cluster resource management. One of the first things the
Application Masterdoes is determine which file blocks are needed for
processing. The Application Master requests details from the NameNode on where
the replicas of the needed data blocks are stored. Using the location data for
the file blocks, the Application Master makes requests to the Resource Manager
to have map tasks process specific blocks on the slave nodes where they’re
stored. The key to efficient MapReduce processing is that, wherever possible,
data isprocessed locally — on the slave node where it’s stored.Before looking
at how the data blocks are processed, you need to look moreclosely at how
Hadoop stores data. In Hadoop, files are composed of individualrecords, which
are ultimately processed one-by-one by mapper tasks. Forexample, the sample
data set we use in this book contains information aboutcompleted flights within
the United States between 1987 and 2008. We have onelarge file for each year,
and within every file, each individual line represents asingle flight. In other
words, one line represents one record. Now, rememberthat the block size for the
Hadoop cluster is 64MB, which means that the lightdata files are broken into
chunks of exactly 64MB.
Do you see the problem? If each map task processes all records in a
specificdata block, what happens to those records that span block
boundaries?File blocks are exactly 64MB (or whatever you set the block size to
be), andbecause HDFS has no conception of what’s inside the file blocks, it
can’t gaugewhen a record might spill over into another block. To solve this
problem,Hadoop uses a logical representation of the data stored in file blocks,
known asinput splits. When a MapReduce job client calculates the input splits,
it figuresout where the first whole record in a block begins and where the last
recordin the block ends. In cases where the last record in a block is
incomplete, theinput split includes location information for the next block and
the byte offsetof the data needed to complete the record. You can configure
the Application Master daemon (or JobTracker, if you’re inHadoop 1) to
calculate the input splits instead of the job client, which wouldbe faster for
jobs processing a large number of data blocks.MapReduce data processing is
driven by this concept of input splits. Thenumber of input splits that are
calculated for a specific application determinesthe number of mapper tasks.
Each of these mapper tasks is assigned, wherepossible, to a slave node where
the input split is stored. The Resource Manager(or JobTracker, if you’re in
Hadoop 1) does its best to ensure that input splitsare processed locally."
sic
Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,Rafael Coss, and Roman
B. Melnyk
Mark Charts
On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte
<[email protected]> wrote:
Hi,
Check this post:
http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size
Regards, D
2014-12-17 15:16 GMT+01:00 Todd <[email protected]>:
Hi Hadoopers,
I got a question about how many blocks does one input split have? It is random
or the number can be configured or fixed(can't be changed)?
Thanks!