Hi, I have a question related to how the mapper generated for the input files 
from HDFS. I understand the split and blocks concept in the HDFS, but my 
originally understanding is that one mapper will only process data from one 
file in HDFS, no matter how small this file it is. Is that correct?
The reason for this is that in some ETL, I did see the logic to understand the 
data set based on the file name convention. So in the mapper, before processing 
the first KV, we can build some logic in the map() method to get the file name 
of the current input, and init some logic here. After that, we don't need to 
worry data could be from another file later, as one mapper task will only 
handle data from one file, even when the file is very small. So small files not 
only cause trouble in NN memory, it also wastes the Map tasks, as map task 
could consume too less data.
But today, when I run following hive query (hadoop 1.0.4 and hive 0.9.1), 
select partition_column, count(*) from test_table group by partition_column
It only generates 2 mappers in MR job. This is an external hive table, and the 
input bytes for this MR job is only 338M, but the data files in the HDFS for 
this table is more than 100, even though a lot of them is very small, as this 
is one node cluster, but it is configured as one node full cluster mode, not 
local mode. Should the MR job generated here trigger at least 100 mappers? Is 
this because in hive that my original assumption not work any more?
Thanks
Yong                                      

Reply via email to