Tez grouping (if enabled) is explained here. https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works
For the rest of the questions, the hive user mailing list would be a better avenue for answers. Bikas From: Nitin Kumar [mailto:[email protected]] Sent: Wednesday, April 20, 2016 10:54 PM To: [email protected] Subject: Managing input split sizes in Hive running the tez engine Hi, I want to gain a better understanding of how in the input splits are calculated in the tez engine. I am aware that the hive.input.format property can be set to either HiveInputFormat (default) or to CombineHiveInputFormat (generally accepted for large number of files having sizes << hdfs block size). I was hoping someone could walk me through the differences on how HiveInputFormat and CombineHiveInputFormat calculate split sizes as data file sizes vary from small (lesser than a block) to large (spanning multiple blocks). I want to dictate the number of mapper tasks that are spawned for scanning a table. For the MR engine this can be controlled by setting the mapred.min.split.size and mapred.max.split.size properties. I need to know if there are similar configurations for the tez engine. Also the properties tez.grouping.max-size, tez.grouping.min-size and tez.grouping.split-waves have been set to the values of 1GB, 16MB and 1.7 respectively. However I observed that the created input splits do not adhere to these properties. I had two files of size 3MB each for a table. According to the set properties, only 1 mapper task should have spawned but 2 mapper tasks spawned instead. Are there other properties in hive/tez that need to be set to enable input split grouping? I would highly appreciate your inputs. Thanks and regards, Nitin
