Hi,

There¹s no relationship between number of containers and tasks ­ well the
number of tasks is the maximum number of containers you can use.

You can run an entire vertex containing many task attempts in one container
if there are no more available ­ because of container reuse.

The memory/cpu settings are actually setup via a configuration parameter ­
hive.tez.container.size.

The Vertex is expanded into multiple tasks ­ The number of map-tasks are
determined by the split-grouping
(tez.grouping.min-size/tez.grouping.split-waves) and the reducers are
estimated from the ReduceSink statistics (divided by
hive.exec.bytes.per.reducer).

Even the reducer number is not final, since the plan-time value is only the
max value for that - you can schedule 1009 reducers and end up only running
11, with Tez auto-reducer parallelism, which only merges adjacent reducers.

This is split between the Tez SplitGrouper, HiveSplitGenerator and
SetReducerParallelism.

Cheers,
Gopal

From:  Yunqi Zhang <yu...@umich.edu>
Reply-To:  "user@hive.apache.org" <user@hive.apache.org>
Date:  Tuesday, June 9, 2015 at 5:07 PM
To:  "user@hive.apache.org" <user@hive.apache.org>
Subject:  Hive on Tez

Hi guys,
 
I¹m playing with the code that integrates Hive on Tez, and have couple
questions regarding to the resource allocation.
 
To my understanding (correct me if I am wrong), Hive creates a DAG composed
of MapVertex and ReduceVertex, where each Vertex will later be translated to
task running on potentially multiple containers by Tez. I was wondering how
the resource requirement is determined (how many containers are needed for
each Vertex, and what are the requirements for CPU and memory, etc.) in the
current implementation, and where I can find the code corresponding to this.
 
Thank you!


Yunqi


Reply via email to