[ 
https://issues.apache.org/jira/browse/YARN-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-4331:
-----------------------------
    Summary: Restarting NodeManager leaves orphaned containers  (was: Killing 
NodeManager leaves orphaned containers)

Note that the killing of the nodemanager itself with SIGKILL should not cause 
the containers to be killed in itself.  Instead the problem seems to be that 
when the nodemanager restarts it is either failing to reacquire the containers 
that were running or it reacquires them and the RM fails to tell the NM to kill 
them when it re-registers.  Updating the summary accordingly.  Also by "the AM 
and its container" I assume you mean the application master and some other 
container that the AM launched.  Please correct me if I'm wrong.  

Is work-preserving nodemanager restart enabled on this cluster?  Without it 
nodemanagers cannot track containers that were previously running, so it will 
not be able to reacquire them and kill them.  If they don't exit on their own 
then they will "leak" and continue running outside of YARN's knowledge.  If 
that feature is not enabled on the nodemanager then this behavior is expected, 
since killing it with SIGKILL gave the nodemanager no chance to perform any 
container cleanup on its own.

If restart is enabled on the nodemanager then this behavior could be correct if 
the application running told the RM that containers should not be killed when 
AM attempts fail.  In that case the container should be left running and its up 
to the AM to reacquire it via some means.  (I believe the RM does provide a bit 
of help there in the AM-RM protocol.)

If the containers were supposed to be killed when the AM attempt failed then we 
need to figure out which of the two possibilities above is the problem.  Could 
you look in the NM logs and see if it said it was able to reacquire the 
previously running containers before it was killed?  If it didn't then we need 
to figure out why, and log snippets around the restart/recovery would be a big 
help.  If it did reacquire the containers and register to the RM with those 
containers then apparently the RM didn't tell the NM to kill the undesired 
containers.  In that case the log from the RM side around the time the NM 
re-registered would be helpful.

> Restarting NodeManager leaves orphaned containers
> -------------------------------------------------
>
>                 Key: YARN-4331
>                 URL: https://issues.apache.org/jira/browse/YARN-4331
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, yarn
>    Affects Versions: 2.7.1
>            Reporter: Joseph
>            Priority: Critical
>
> We are seeing a lot of orphaned containers running in our production clusters.
> I tried to simulate this locally on my machine and can replicate the issue by 
> killing nodemanager.
> I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza 
> jobs.
> Steps:
> {quote}1. Deploy a job 
> 2. Issue a kill -9 signal to nodemanager 
> 3. We should see the AM and its container running without nodemanager
> 4. AM should die but the container still keeps running
> 5. Restarting nodemanager brings up new AM and container but leaves the 
> orphaned container running in the background
> {quote}
> This is effectively causing double processing of data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to