[
https://issues.apache.org/jira/browse/YARN-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated YARN-4331:
-----------------------------
Summary: Restarting NodeManager leaves orphaned containers (was: Killing
NodeManager leaves orphaned containers)
Note that the killing of the nodemanager itself with SIGKILL should not cause
the containers to be killed in itself. Instead the problem seems to be that
when the nodemanager restarts it is either failing to reacquire the containers
that were running or it reacquires them and the RM fails to tell the NM to kill
them when it re-registers. Updating the summary accordingly. Also by "the AM
and its container" I assume you mean the application master and some other
container that the AM launched. Please correct me if I'm wrong.
Is work-preserving nodemanager restart enabled on this cluster? Without it
nodemanagers cannot track containers that were previously running, so it will
not be able to reacquire them and kill them. If they don't exit on their own
then they will "leak" and continue running outside of YARN's knowledge. If
that feature is not enabled on the nodemanager then this behavior is expected,
since killing it with SIGKILL gave the nodemanager no chance to perform any
container cleanup on its own.
If restart is enabled on the nodemanager then this behavior could be correct if
the application running told the RM that containers should not be killed when
AM attempts fail. In that case the container should be left running and its up
to the AM to reacquire it via some means. (I believe the RM does provide a bit
of help there in the AM-RM protocol.)
If the containers were supposed to be killed when the AM attempt failed then we
need to figure out which of the two possibilities above is the problem. Could
you look in the NM logs and see if it said it was able to reacquire the
previously running containers before it was killed? If it didn't then we need
to figure out why, and log snippets around the restart/recovery would be a big
help. If it did reacquire the containers and register to the RM with those
containers then apparently the RM didn't tell the NM to kill the undesired
containers. In that case the log from the RM side around the time the NM
re-registered would be helpful.
> Restarting NodeManager leaves orphaned containers
> -------------------------------------------------
>
> Key: YARN-4331
> URL: https://issues.apache.org/jira/browse/YARN-4331
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager, yarn
> Affects Versions: 2.7.1
> Reporter: Joseph
> Priority: Critical
>
> We are seeing a lot of orphaned containers running in our production clusters.
> I tried to simulate this locally on my machine and can replicate the issue by
> killing nodemanager.
> I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza
> jobs.
> Steps:
> {quote}1. Deploy a job
> 2. Issue a kill -9 signal to nodemanager
> 3. We should see the AM and its container running without nodemanager
> 4. AM should die but the container still keeps running
> 5. Restarting nodemanager brings up new AM and container but leaves the
> orphaned container running in the background
> {quote}
> This is effectively causing double processing of data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)