[ 
https://issues.apache.org/jira/browse/YARN-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501037#comment-14501037
 ] 

Jun Gong commented on YARN-3474:
--------------------------------

Any comments appreciate.

> Add a way to let NM wait RM to come back, not kill running containers
> ---------------------------------------------------------------------
>
>                 Key: YARN-3474
>                 URL: https://issues.apache.org/jira/browse/YARN-3474
>             Project: Hadoop YARN
>          Issue Type: New Feature
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3474.01.patch
>
>
> When RM HA is enabled and active RM shuts down, standby RM will become 
> active, recover apps and attempts. Apps will not be affected. 
> If there are some cases or bugs that cause both RM could not start 
> normally(e.g. [YARN-2340|https://issues.apache.org/jira/browse/YARN-2340]; RM 
> could not connect with ZK well). NM will kill containers running on it when  
> it could not heartbeat with RM for some time(max retry time is 15 mins by 
> default). Then all apps will be killed. 
> In production cluster, we might come across above cases and fixing these bugs 
> might need time more than 15 mins. In order to let apps not be affected and 
> killed by NM, YARN admin could set a flag(the flag is a znode 
> '/wait-rm-to-come-back/cluster-id' in our solution) to tell NM wait for RM to 
> come back and not kill running containers. After fixing bugs and RM start 
> normally, clear the flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to