Hi, There¹s no relationship between number of containers and tasks well the number of tasks is the maximum number of containers you can use.
You can run an entire vertex containing many task attempts in one container if there are no more available because of container reuse. The memory/cpu settings are actually setup via a configuration parameter hive.tez.container.size. The Vertex is expanded into multiple tasks The number of map-tasks are determined by the split-grouping (tez.grouping.min-size/tez.grouping.split-waves) and the reducers are estimated from the ReduceSink statistics (divided by hive.exec.bytes.per.reducer). Even the reducer number is not final, since the plan-time value is only the max value for that - you can schedule 1009 reducers and end up only running 11, with Tez auto-reducer parallelism, which only merges adjacent reducers. This is split between the Tez SplitGrouper, HiveSplitGenerator and SetReducerParallelism. Cheers, Gopal From: Yunqi Zhang <yu...@umich.edu> Reply-To: "user@hive.apache.org" <user@hive.apache.org> Date: Tuesday, June 9, 2015 at 5:07 PM To: "user@hive.apache.org" <user@hive.apache.org> Subject: Hive on Tez Hi guys, I¹m playing with the code that integrates Hive on Tez, and have couple questions regarding to the resource allocation. To my understanding (correct me if I am wrong), Hive creates a DAG composed of MapVertex and ReduceVertex, where each Vertex will later be translated to task running on potentially multiple containers by Tez. I was wondering how the resource requirement is determined (how many containers are needed for each Vertex, and what are the requirements for CPU and memory, etc.) in the current implementation, and where I can find the code corresponding to this. Thank you! Yunqi