RE: Managing input split sizes in Hive running the tez engine

Bikas Saha Thu, 21 Apr 2016 10:57:36 -0700

Tez grouping (if enabled) is explained here. 
https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works


 

For the rest of the questions, the hive user mailing list would be a better 
avenue for answers.

 

Bikas

 

From: Nitin Kumar [mailto:[email protected]] 
Sent: Wednesday, April 20, 2016 10:54 PM
To: [email protected]
Subject: Managing input split sizes in Hive running the tez engine

 

Hi,

I want to gain a better understanding of how in the input splits are calculated 
in the tez engine.


I am aware that the hive.input.format property can be set to either 
HiveInputFormat (default) or to CombineHiveInputFormat (generally accepted for 
large number of files having sizes << hdfs block size). 

I was hoping someone could walk me through the differences on how 
HiveInputFormat and CombineHiveInputFormat calculate split sizes as data file 
sizes vary from small (lesser than a block) to large (spanning multiple blocks).

I want to dictate the number of mapper tasks that are spawned for scanning a 
table. For the MR engine this can be controlled by setting the 
mapred.min.split.size and mapred.max.split.size properties. I need to know if 
there are similar configurations for the tez engine.

 

Also the properties tez.grouping.max-size, tez.grouping.min-size and 
tez.grouping.split-waves have been set to the values of 1GB, 16MB and 1.7 
respectively. However I observed that the created input splits do not adhere to 
these properties. 

I had two files of size 3MB each for a table. According to the set properties, 
only 1 mapper task should have spawned but 2 mapper tasks spawned instead.

Are there other properties in hive/tez that need to be set to enable input 
split grouping?

I would highly appreciate your inputs.

Thanks and regards,

Nitin

RE: Managing input split sizes in Hive running the tez engine

Reply via email to