Hi guys,
I was measuring the time it takes to a delayed container (kept for
container reuse) to be released when the tez application master is going to
shutdown at the end of its life.
I run the same Hive-on-Tez query 100 times, and as you can see in the
attached plot there is something strange:
- most of the containers (around 80%) are released in almost exactly one
second
- a few containers are released in a time that spans from a very few
milliseconds to approximately a time equal to the AM-RM heartbeat
(suggesting that the AM is the one telling the RM about the end of the
container).
The NM-RM heartbeat time is 1s and I consider the release interval to be
between the "Sending a stop request to the NM for ContainerId" log entry
(AM side) and the queue update (RM side).
I could manually check just a few logs, but it seems the second case
happens when the container is actually able to stop before the end of the
AM, while if the AM dies we fall in the first case.
I have a suspect that if the AM is dead, the RM will wait for the NM
heartbeat to consider the resources available, anyway what I would expect
in this case is to have a uniform distribution between delta and 1s+delta
(with delta equal to a few ms).
What is really happening here in your opinion? How can the variance of the
first case be so small?

Thanks

Fabio

Reply via email to