[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547155#comment-14547155
 ] 

sandflee commented on YARN-3644:
--------------------------------

[~raju.bairishetti] thanks for your reply,  If RM HA is not enabled, we can fix 
it like this. But with RM HA, there're some condition to consider.
1, both RM A and RM B  reset the connection,  seems RMs are in trouble, NM keep 
containers alive
2, both RM A and RM B socket timeout, seems NM are network partitioned with RMs 
or RM machine all crashed(Any way to distinguish them?), NM kills all containers
3, one RM reset the connection and the other socket timeout, It's difficult to 
handle, sine we knows nothing about active RM, both RM maybe all crashed, or 
just active RM are network partitioned. 
I suggest backup RM also responses and tells NM I'm backup RM. So It becomes 
3.1  one RM reset the connection and the other socket timeout, seems RM in 
trouble, just keep containers alive 
3.2  one RM are backup and  the other RM socket timeout,  seems NM are network 
partitioned with active RM, kill all containers

> Node manager shuts down if unable to connect with RM
> ----------------------------------------------------
>
>                 Key: YARN-3644
>                 URL: https://issues.apache.org/jira/browse/YARN-3644
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Srikanth Sundarrajan
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>           } catch (ConnectException e) {
>             //catch and throw the exception if tried MAX wait time to connect 
> RM
>             dispatcher.getEventHandler().handle(
>                 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
>             throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to