Hi, On Sun, Nov 18, 2012 at 11:25 AM, Majid Azimi <[email protected]> wrote: > hi guys, > > I want to get confirmation that I have understood this topic correctly. HDFS > block size is number of bytes that HDFS will split a large files into small > tokens. Input split size is number bytes each mapper will actually process. > It may be less or more than hdfs block size. Am I right?
Yes. > suppose we want to load a 110MB text file to hdfs. hdfs block size and Input > split size is set to 64MB. > > 1. number of mappers is based on number of Input splits not number of hdfs > blocks? right? Correct. Although the default logic tries to borrow the split size based on the block size, there is no such hard requirement. > 2. When we set hdfs block to 64MB, Is this exactly 67108864(64*1024*1024) > bytes? I mean it doesn't matter the file will be splitted from middle of the > line. Yes, HDFS makes arbitrary sized blocks. HDFS does not concern itself about file's contents (like a FileSystem doesn't). The reader is expected to handle reading records properly (i.e. until last newline, etc.). See http://wiki.apache.org/hadoop/HadoopMapReduce for how MR reads the records without a break in between, even if the block boundary has broken a record. > 3. Now we have 2 input split (so two maps). Last line of first block and > first line of second block is not meaningful. TextInputFormat is responsible > for reading meaningful lines and giving them to map jobs. What > TextInputFormat does is: > > In second block it will seek to second line which is a complete line and > read from there and gives it to second mapper. > First mapper will read until the end of first block and also it will process > the (last incomplete line of first block + first incomplete line of second > block). Yes, this is explained at http://wiki.apache.org/hadoop/HadoopMapReduce as well. > So the Input split size of first mapper is not exactly 64MB. it is a bit > more than that(first incomplete line of second block). Also Input split size > of second mapper is a bit less than 64 MB. Am I right? > So hdfs block size is an exact number but Input split size is based on our > data logic which may be a little different with the configured number? > right? Yes, all correct. -- Harsh J
