Daniel Templeton created YARN-5677:
--------------------------------------
Summary: RM can be in active-active state for an extended period
Key: YARN-5677
URL: https://issues.apache.org/jira/browse/YARN-5677
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 3.0.0-alpha1, 2.7.3
Reporter: Daniel Templeton
Assignee: Daniel Templeton
Priority: Critical
Both branch-2.8/trunk and branch-2.7 have issues when the active RM loses
contact with the ZK node(s).
In branch-2.7, the RM will retry the connection 1000 times by default.
Attempting to contact a node which cannot be reached is slow, which means the
active can take over an hour to realize it is no longer active. I clocked it
at about an hour and a half in my tests. The solution appears to be to add
some time awareness into the retry loop.
In branch-2.8/trunk, there is no maximum number of retries that I see. It
appears the connection will be retried forever, with the active never figuring
out it's no longer active. I have a test running, and I'll update this
description with empirical findings when I'm done. The solution appears to be
to cap the number of retries or amount of time spent retrying.
This issue is significant because of the asynchronous nature of job submission.
If the active doesn't know it's not active, it will buffer up job submissions
until it finally realizes it has become the standby. Then it will fail all the
job submissions in bulk. In high-volume workflows, that behavior can create
huge mass job failures.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]