hi guys, I want to get confirmation that I have understood this topic correctly. HDFS block size is number of bytes that HDFS will split a large files into small tokens. Input split size is number bytes each mapper will actually process. It may be less or more than hdfs block size. Am* *I right?
suppose we want to load a 110MB text file to hdfs. hdfs block size and Input split size is set to 64MB. 1. number of mappers is based on number of Input splits not number of hdfs blocks? right? 2. When we set hdfs block to 64MB, Is this exactly 67108864(64*1024*1024) bytes? I mean it doesn't matter the file will be splitted from middle of the line. 3. Now we have 2 input split (so two maps). Last line of first block and first line of second block is not meaningful. TextInputFormat is responsible for reading meaningful lines and giving them to map jobs. What TextInputFormat does is: - In second block it will seek to second line which is a complete line and read from there and gives it to second mapper. - First mapper will read until the end of first block and also it will process the (last incomplete line of first block + first incomplete line of second block). So the Input split size of first mapper is not exactly 64MB. it is a bit more than that(first incomplete line of second block). Also Input split size of second mapper is a bit less than 64 MB. Am I right? So hdfs block size is an exact number but Input split size is based on our data logic which may be a little different with the configured number? right?
