[ https://issues.apache.org/jira/browse/YARN-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16253956#comment-16253956 ]
Jason Lowe commented on YARN-2331: ---------------------------------- There is documentation of the property at https://hadoop.apache.org/docs/r2.8.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml, but I agree it could be better. bq. Is there a reason that you need to distinguish between a supervised NM shutdown and a rolling upgrade related shutdown? Yes, in the sense that the two shutdowns may be different depending upon how the rolling upgrade shutdown was performed. For example, in our clusters we do not have direct supervision on the nodemanagers and instead have another tool that periodically comes along and services nodes that have fallen out of the cluster. That means the nodemanager will not necessarily be restarted in a timely manner if it crashes. In that case we want the nodemanager to shutdown cleanly during the crash, killing all running containers since otherwise they will be unsupervised and the RM will believe the containers are dead due to lack of NM heartbeats from this node. If the NM were under direct supervision then it will be restarted quickly after it crashes. In that scenario we would _not_ want it to kill the containers and instead let the NM recover the containers upon restart. For rolling upgrades we kill the nodemanager with SIGKILL, preventing it from doing any cleanup processing. Then we restart the nodemanagers on the new software, and the nodemanager recovers the containers on startup. In our clusters the work preserving and supervised properties are set differently so the NM knows to support recovery yet still kill containers on shutdown. Before this change the NM would always kill containers on a shutdown, so it would be impossible to preserve work in the case where the NM threw an exception and performed an orderly shutdown yet the NM was under supervision. In 2.8 and later the nodemanager restart documentation moved to a unified nodemanager page, e.g.: https://hadoop.apache.org/docs/r2.8.2/hadoop-yarn/hadoop-yarn-site/NodeManager.html, but it still doesn't describe this property. I filed YARN-7502 to update the nodemanager restart docs to cover this property and when it would be useful. > Distinguish shutdown during supervision vs. shutdown for rolling upgrade > ------------------------------------------------------------------------ > > Key: YARN-2331 > URL: https://issues.apache.org/jira/browse/YARN-2331 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Affects Versions: 2.6.0 > Reporter: Jason Lowe > Assignee: Jason Lowe > Labels: BB2015-05-RFC > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: YARN-2331.patch, YARN-2331v2.patch, YARN-2331v3.patch > > > When the NM is shutting down with restart support enabled there are scenarios > we'd like to distinguish and behave accordingly: > # The NM is running under supervision. In that case containers should be > preserved so the automatic restart can recover them. > # The NM is not running under supervision and a rolling upgrade is not being > performed. In that case the shutdown should kill all containers since it is > unlikely the NM will be restarted in a timely manner to recover them. > # The NM is not running under supervision and a rolling upgrade is being > performed. In that case the shutdown should not kill all containers since a > restart is imminent due to the rolling upgrade and the containers will be > recovered. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org