[ https://issues.apache.org/jira/browse/YARN-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jun Gong resolved YARN-3474. ---------------------------- Resolution: Invalid > Add a way to let NM wait RM to come back, not kill running containers > --------------------------------------------------------------------- > > Key: YARN-3474 > URL: https://issues.apache.org/jira/browse/YARN-3474 > Project: Hadoop YARN > Issue Type: New Feature > Affects Versions: 2.6.0 > Reporter: Jun Gong > Assignee: Jun Gong > Attachments: YARN-3474.01.patch > > > When RM HA is enabled and active RM shuts down, standby RM will become > active, recover apps and attempts. Apps will not be affected. > If there are some cases or bugs that cause both RM could not start > normally(e.g. [YARN-2340|https://issues.apache.org/jira/browse/YARN-2340]; RM > could not connect with ZK well). NM will kill containers running on it when > it could not heartbeat with RM for some time(max retry time is 15 mins by > default). Then all apps will be killed. > In production cluster, we might come across above cases and fixing these bugs > might need time more than 15 mins. In order to let apps not be affected and > killed by NM, YARN admin could set a flag(the flag is a znode > '/wait-rm-to-come-back/cluster-id' in our solution) to tell NM wait for RM to > come back and not kill running containers. After fixing bugs and RM start > normally, clear the flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332)