Hi, I have a question related to how the mapper generated for the input files
from HDFS. I understand the split and blocks concept in the HDFS, but my
originally understanding is that one mapper will only process data from one
file in HDFS, no matter how small this file it is. Is that correct?
The reason for this is that in some ETL, I did see the logic to understand the
data set based on the file name convention. So in the mapper, before processing
the first KV, we can build some logic in the map() method to get the file name
of the current input, and init some logic here. After that, we don't need to
worry data could be from another file later, as one mapper task will only
handle data from one file, even when the file is very small. So small files not
only cause trouble in NN memory, it also wastes the Map tasks, as map task
could consume too less data.
But today, when I run following hive query (hadoop 1.0.4 and hive 0.9.1),
select partition_column, count(*) from test_table group by partition_column
It only generates 2 mappers in MR job. This is an external hive table, and the
input bytes for this MR job is only 338M, but the data files in the HDFS for
this table is more than 100, even though a lot of them is very small, as this
is one node cluster, but it is configured as one node full cluster mode, not
local mode. Should the MR job generated here trigger at least 100 mappers? Is
this because in hive that my original assumption not work any more?
Thanks
Yong