Thanks for the detailed explanation Yong. It helps.


On Tuesday, February 25, 2014 9:18 PM, java8964 <> wrote:
Yes, it is good that the file sizes are evenly close, but not very important, 
unless there are files very small (compared to the block size).

The reasons are:

Your files should be splitable to be used in Hadoop (Or in Hive, it is the same 
thing). If they are splitable, then 1G file will use 10 blocks (assume the 
block size is 128M), and 256M file will take 2 blocks. So these 2 files will 
generate 12 mapper tasks, and will be equally run in your cluster. From 
performance point of view, you have 12 mapper tasks, and they are equally 
processed in the cluster. So one 1G file plus one 256M file are not big deal. 
But if you have one file are very small, like 10M, that one file will also 
consume one mapper task, and that is kind of bad for performance, as hadoop 
starting one mapper task only consuming 10M data, which is bad, because 
starting/stop tasks is using quite some resource, but only processing 10M data.

The reason you see unevenly file size of the output of sqoop is that it is hard 
for sqoop to split your source data evenly. For example, if you dump table A 
from DB to hive, sqoop will do the following:

1) Identify the primary/unique keys of the table.
2) Find out the min/max value of the keys, let say they are (1 to 1,000,000)
3) Based on # of your mapper task, split them. If you run sqoop with 4 mappers, 
then the data will be split into 4 groups (1, 250,000) (250,001, 500,000) 
(500,001, 750,000) (750,001, 1,000,000). As you can image, your data most 
likely are not even distributed by the primary keys in that 4 groups, then you 
will get unevenly output as part-m-xxx files.

Keep in mind that it is not required to use primary keys or unique keys as the 
split column. So you can choose whateven column in your table make sense. Pick 
up whateven can make the split more even.


Date: Tue, 25 Feb 2014 17:42:20 -0800
Subject: part-m-00000 files and their size - Hive table


I am loading data to HDFS files through sqoop and creating a Hive table to 
point to these files.

The mapper files through sqoop example are generated like this below.




My question is -
1) For Hive query performance , how important or significant is the 
distribution of the file sizes above.

part_m_0 say 1 GB
part_m_1 say 3 GB
part_m_1 say 0.25 GB


part_m_0 say 1.4 GB
part_m_1 say 1.4 GB
part_m_1 say  1.45 B

NOTE : The size and no of files is just for sample. The real numbers are far 

I am assuming the uniform distribution has a performance benefit .

If so, what is the reason and can I know the technical details. 

Reply via email to