Jason Lowe commented on YARN-4331:

SAMZA-750 is discussing RM restart, but this is NM restart.  They are related 
but mostly independent features, and one can be enabled without the other.  
Check if yarn.nodemanager.recovery.enabled=true on that node.  If you want to 
support rolling upgrades of the entire YARN cluster they both need to be 
enabled, but if you simply want to restart/upgrade a NodeManager independent of 
the ResourceManager then you can turn on nodemanager restart without 
resourcemanager restart.  NodeManager restart should be mostly invisible to 
applications except for interruptions in the auxiliary services on that node 
(e.g.: shuffle handler).

bq. if the application master (AM) is dead, shouldn't it be responsibility of 
the container to kill itself?

That is completely application framework dependent and not the responsibility 
of YARN.  A container is completely under the control of the application (i.e.: 
user code) and doesn't have to have any YARN code in it at all.  Theoretically 
one could write an application entirely in C or Go or whatever that generates 
compatible protocol buffers and adheres to the YARN RPC protocol semantics.  No 
YARN code would be running at all for that application or in any of its 
containers at that point.  (I know of no such applications, but it is 
theoretically possible.)

Also it is not a requirement that containers have an umbilical connection to 
the ApplicationMaster.  That choice is up to the application, and some 
applications don't do this (like the distributed shell sample YARN 
application).  MapReduce is an application framework that does have an 
umbilical connection, but if there's a bug in that app where tasks don't 
properly recognize the umbilical was severed then that's a bug in the app and 
not a bug in YARN.  Once the nodemanager died on the node, YARN lost all 
ability to control containers on that node.  If the container decides not to 
exit then that's an issue with the app more than an issue with YARN.  There's 
not much YARN can do about it since YARN's actor on that node is no longer 

If NM restart is not enabled then the nodemanager should _not_ be killed with 
SIGKILL.  Simply kill it with SIGTERM and the nodemanager should attempt to 
kill all containers before shutting down.  Killing the NM with SIGKILL is 
normally only done when performing a work-preserving restart on the NM, and 
that requres that yarn.nodemanager.recovery.enabled=true on that node to 
function properly.

> Restarting NodeManager leaves orphaned containers
> -------------------------------------------------
>                 Key: YARN-4331
>                 URL: https://issues.apache.org/jira/browse/YARN-4331
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, yarn
>    Affects Versions: 2.7.1
>            Reporter: Joseph
>            Priority: Critical
> We are seeing a lot of orphaned containers running in our production clusters.
> I tried to simulate this locally on my machine and can replicate the issue by 
> killing nodemanager.
> I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza 
> jobs.
> Steps:
> {quote}1. Deploy a job 
> 2. Issue a kill -9 signal to nodemanager 
> 3. We should see the AM and its container running without nodemanager
> 4. AM should die but the container still keeps running
> 5. Restarting nodemanager brings up new AM and container but leaves the 
> orphaned container running in the background
> {quote}
> This is effectively causing double processing of data.

This message was sent by Atlassian JIRA

Reply via email to