[ 
https://issues.apache.org/jira/browse/YARN-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16253956#comment-16253956
 ] 

Jason Lowe commented on YARN-2331:
----------------------------------

There is documentation of the property at 
https://hadoop.apache.org/docs/r2.8.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml,
 but I agree it could be better.

bq. Is there a reason that you need to distinguish between a supervised NM 
shutdown and a rolling upgrade related shutdown?

Yes, in the sense that the two shutdowns may be different depending upon how 
the rolling upgrade shutdown was performed.  For example, in our clusters we do 
not have direct supervision on the nodemanagers and instead have another tool 
that periodically comes along and services nodes that have fallen out of the 
cluster.  That means the nodemanager will not necessarily be restarted in a 
timely manner if it crashes.  In that case we want the nodemanager to shutdown 
cleanly during the crash, killing all running containers since otherwise they 
will be unsupervised and the RM will believe the containers are dead due to 
lack of NM heartbeats from this node.  If the NM were under direct supervision 
then it will be restarted quickly after it crashes.  In that scenario we would 
_not_ want it to kill the containers and instead let the NM recover the 
containers upon restart.

For rolling upgrades we kill the nodemanager with SIGKILL, preventing it from 
doing any cleanup processing.  Then we restart the nodemanagers on the new 
software, and the nodemanager recovers the containers on startup.  In our 
clusters the work preserving and supervised properties are set differently so 
the NM knows to support recovery yet still kill containers on shutdown.  Before 
this change the NM would always kill containers on a shutdown, so it would be 
impossible to preserve work in the case where the NM threw an exception and 
performed an orderly shutdown yet the NM was under supervision.

In 2.8 and later the nodemanager restart documentation moved to a unified 
nodemanager page, e.g.: 
https://hadoop.apache.org/docs/r2.8.2/hadoop-yarn/hadoop-yarn-site/NodeManager.html,
 but it still doesn't describe this property.  I filed YARN-7502 to update the 
nodemanager restart docs to cover this property and when it would be useful.


> Distinguish shutdown during supervision vs. shutdown for rolling upgrade
> ------------------------------------------------------------------------
>
>                 Key: YARN-2331
>                 URL: https://issues.apache.org/jira/browse/YARN-2331
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>              Labels: BB2015-05-RFC
>             Fix For: 2.8.0, 3.0.0-alpha1
>
>         Attachments: YARN-2331.patch, YARN-2331v2.patch, YARN-2331v3.patch
>
>
> When the NM is shutting down with restart support enabled there are scenarios 
> we'd like to distinguish and behave accordingly:
> # The NM is running under supervision.  In that case containers should be 
> preserved so the automatic restart can recover them.
> # The NM is not running under supervision and a rolling upgrade is not being 
> performed.  In that case the shutdown should kill all containers since it is 
> unlikely the NM will be restarted in a timely manner to recover them.
> # The NM is not running under supervision and a rolling upgrade is being 
> performed.  In that case the shutdown should not kill all containers since a 
> restart is imminent due to the rolling upgrade and the containers will be 
> recovered.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to