[
https://issues.apache.org/jira/browse/YARN-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jun Gong resolved YARN-3474.
----------------------------
Resolution: Invalid
> Add a way to let NM wait RM to come back, not kill running containers
> ---------------------------------------------------------------------
>
> Key: YARN-3474
> URL: https://issues.apache.org/jira/browse/YARN-3474
> Project: Hadoop YARN
> Issue Type: New Feature
> Affects Versions: 2.6.0
> Reporter: Jun Gong
> Assignee: Jun Gong
> Attachments: YARN-3474.01.patch
>
>
> When RM HA is enabled and active RM shuts down, standby RM will become
> active, recover apps and attempts. Apps will not be affected.
> If there are some cases or bugs that cause both RM could not start
> normally(e.g. [YARN-2340|https://issues.apache.org/jira/browse/YARN-2340]; RM
> could not connect with ZK well). NM will kill containers running on it when
> it could not heartbeat with RM for some time(max retry time is 15 mins by
> default). Then all apps will be killed.
> In production cluster, we might come across above cases and fixing these bugs
> might need time more than 15 mins. In order to let apps not be affected and
> killed by NM, YARN admin could set a flag(the flag is a znode
> '/wait-rm-to-come-back/cluster-id' in our solution) to tell NM wait for RM to
> come back and not kill running containers. After fixing bugs and RM start
> normally, clear the flag.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)