1 maptask = 1 input split, but a Mapperclass can handle multiple tasks albeit one at a time..
2014-12-18 4:54 GMT+01:00 [email protected] <[email protected]>: > > Sure, thanks Mark. That mean, the completed mapper task is not reused to > work on the pending input splits. > > ------------------------------ > [email protected] > > > *From:* daemeon reiydelle <[email protected]> > *Date:* 2014-12-18 11:11 > *To:* user <[email protected]> > *CC:* mark charts <[email protected]> > *Subject:* Re: Re: How many blocks does one input split have? > There would be thousands of tasks, but not all fired off at the same time. > The number of parallel tasks is configurable but typically 1 per data node > core. > > > *.......* > > On Wed, Dec 17, 2014 at 6:31 PM, [email protected] <[email protected]> wrote: >> >> Thanks Mark and Dieter for the reply. >> >> Actually, I got another question in mind. What's the relationship between >> input split and mapper task?Is it one one relation or a mapper task can >> handle more than one input splits? >> >> If mapper task can only handle one input split, then if there are many >> input splits(say, the the original file is 1TB or larger,then there may be >> thousands of input splits), thousands of mapper tasks would be created. >> >> ------------------------------ >> [email protected] >> >> >> *From:* mark charts <[email protected]> >> *Date:* 2014-12-18 00:15 >> *To:* [email protected] >> *Subject:* Re: How many blocks does one input split have? >> Hello. >> >> >> FYI. >> >> "The way HDFS has been set up, it breaks down very large files into large >> blocks >> (for example, measuring 128MB), and stores three copies of these blocks on >> different nodes in the cluster. HDFS has no awareness of the content of >> these >> files. >> >> In YARN, when a MapReduce job is started, the Resource Manager (the >> cluster resource management and job scheduling facility) creates an >> Application Master daemon to look after the lifecycle of the job. (In >> Hadoop 1, >> the JobTracker monitored individual jobs as well as handling job >> -scheduling >> and cluster resource management. One of the first things the Application >> Master >> does is determine which file blocks are needed for processing. The >> Application >> Master requests details from the NameNode on where the replicas of the >> needed data blocks are stored. Using the location data for the file blocks, >> the Application >> Master makes requests to the Resource Manager to have map tasks process >> specific >> blocks on the slave nodes where they’re stored. >> The key to efficient MapReduce processing is that, wherever possible, >> data is >> processed locally — on the slave node where it’s stored. >> Before looking at how the data blocks are processed, you need to look more >> closely at how Hadoop stores data. In Hadoop, files are composed of >> individual >> records, which are ultimately processed one-by-one by mapper tasks. For >> example, the sample data set we use in this book contains information >> about >> completed flights within the United States between 1987 and 2008. We have >> one >> large file for each year, and within every file, each individual line >> represents a >> single flight. In other words, one line represents one record. Now, >> remember >> that the block size for the Hadoop cluster is 64MB, which means that the >> light >> data files are broken into chunks of exactly 64MB. >> >> Do you see the problem? If each map task processes all records in a >> specific >> data block, what happens to those records that span block boundaries? >> File blocks are exactly 64MB (or whatever you set the block size to be), >> and >> because HDFS has no conception of what’s inside the file blocks, it can’t >> gauge >> when a record might spill over into another block. To solve this problem, >> Hadoop uses a logical representation of the data stored in file blocks, >> known as >> input splits. When a MapReduce job client calculates the input splits, it >> figures >> out where the first whole record in a block begins and where the last >> record >> in the block ends. In cases where the last record in a block is >> incomplete, the >> input split includes location information for the next block and the byte >> offset >> of the data needed to complete the record. >> You can configure the Application Master daemon (or JobTracker, if you’re >> in >> Hadoop 1) to calculate the input splits instead of the job client, which >> would >> be faster for jobs processing a large number of data blocks. >> MapReduce data processing is driven by this concept of input splits. The >> number of input splits that are calculated for a specific application >> determines >> the number of mapper tasks. Each of these mapper tasks is assigned, where >> possible, to a slave node where the input split is stored. The Resource >> Manager >> (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input >> splits >> are processed locally." *sic* >> >> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown, >> Rafael Coss, and Roman B. Melnyk >> >> >> >> Mark Charts >> >> >> >> >> On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte < >> [email protected]> wrote: >> >> >> Hi, >> >> Check this post: >> http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size >> >> Regards, D >> >> >> 2014-12-17 15:16 GMT+01:00 Todd <[email protected]>: >> >> Hi Hadoopers, >> >> I got a question about how many blocks does one input split have? It is >> random or the number can be configured or fixed(can't be changed)? >> Thanks! >> >> >> >>
