> I have a hive full scan job , with hive on mr I can fully use >the whole cluster's 1000 cpu vcores(I use the split size to make mapper >tasks to be 1200), > But in tez, tez only use around 700 vcores, I have also set the same >hive split size. So how do I configure tez? to make tez fully use all the >cluster resources?
If you¹re on hive-1.0/later, the option to go wide is called tez.grouping.split-waves. With ORC, the regular MRv2 splits generates empty tasks (so that not all map-tasks process valid ranges). But to get it as wide as possible set mapred.max.split.size=33554432 set tez.grouping.split-waves=1.7 set tez.grouping.min-size=16777216 should do the trick, the split-waves measures current queue capacity * 1.7x to go wider than the actual available capacity. In previous versions (0.13/0.14), ³set² commands don¹t work, so the options are prefixed by the tez.am.* - you have to do hive -hiveconf tez.am.grouping.split-waves=1.7 -hiveconf tez.grouping.min-size=16777216 -hiveconf mapred.max.split.size=33554432 We hope to throw away these hacks in hive-1.2 & for this Prasanth checked in a couple of different split strategies for ORC in hive-1.2.0 (ETL/BI/HYBRID) etc. I will probably send out my slides about ORC (incl. new split gen) after Hadoop Summit Europe, if you want more details. Ideally, any tests with the latest code would help me fix anything that¹s specific to your use-cases. Cheers, Gopal
