[
https://issues.apache.org/jira/browse/YARN-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15530118#comment-15530118
]
Daniel Templeton commented on YARN-5677:
----------------------------------------
Just tested the leader election (with the right property enabled), and it works
as advertised. The active becomes standby by the time the standby becomes
active.
> RM can be in active-active state for an extended period
> -------------------------------------------------------
>
> Key: YARN-5677
> URL: https://issues.apache.org/jira/browse/YARN-5677
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.7.3, 3.0.0-alpha1
> Reporter: Daniel Templeton
> Assignee: Daniel Templeton
> Priority: Critical
>
> Both branch-2.8/trunk and branch-2.7 have issues when the active RM loses
> contact with the ZK node(s).
> In branch-2.7, the RM will retry the connection 1000 times by default.
> Attempting to contact a node which cannot be reached is slow, which means the
> active can take over an hour to realize it is no longer active. I clocked it
> at about an hour and a half in my tests. The solution appears to be to add
> some time awareness into the retry loop.
> In branch-2.8/trunk, there is no maximum number of retries that I see. It
> appears the connection will be retried forever, with the active never
> figuring out it's no longer active. In my testing, the active-active state
> lasted almost 2 hours with no sign of stopping before I killed it. The
> solution appears to be to cap the number of retries or amount of time spent
> retrying.
> This issue is significant because of the asynchronous nature of job
> submission. If the active doesn't know it's not active, it will buffer up
> job submissions until it finally realizes it has become the standby. Then it
> will fail all the job submissions in bulk. In high-volume workflows, that
> behavior can create huge mass job failures.
> This issue is also important because the node managers will not fail over to
> the new active until the old active realizes it's the standby. Workloads
> submitted after the old active loses contact with ZK will therefore fail to
> be executed regardless of which RM the clients contact.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]