[
https://issues.apache.org/jira/browse/YARN-11184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554318#comment-17554318
]
Steven Rand commented on YARN-11184:
------------------------------------
Possibly [ZOOKEEPER-2251|https://issues.apache.org/jira/browse/ZOOKEEPER-2251]
is related? The thread dump is different, but it appears to be a similar
problem of the {{StandByTransitionThread}} waiting indefinitely for a response.
The ZK version used client side by hadoop does not include the fix for that
issue.
> fenced active RM not failing over correctly in HA setup
> -------------------------------------------------------
>
> Key: YARN-11184
> URL: https://issues.apache.org/jira/browse/YARN-11184
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 3.2.3
> Reporter: Steven Rand
> Priority: Major
> Attachments: image-2022-06-14-16-38-00-336.png,
> image-2022-06-14-16-39-50-278.png, image-2022-06-14-16-41-39-742.png,
> image-2022-06-14-16-44-45-101.png
>
>
> We've observed an issue recently on a production cluster running 3.2.3 in
> which a fenced Resource Manager remains active, but does not communicate with
> the ZK state store, and therefore cannot function correctly. This did not
> occur while running 3.2.2 on the same cluster.
> In more detail, what seems to happen is:
> 1. The active RM gets a {{NodeExists}} error from ZK while storing an app in
> the state store. I suspect that this is caused by some transient connection
> issue that causes the first node creation request to succeed, but for the
> response to not reach the RM, triggering a duplicate request which fails with
> this error.
> !image-2022-06-14-16-38-00-336.png!
> 2. Because of this error, the active RM is fenced.
> !image-2022-06-14-16-39-50-278.png!
> 3. Because it is fenced, the active RM starts to transition to standby.
> !image-2022-06-14-16-41-39-742.png! 4. However, the RM never fully
> transitions to standby. It never logs {{Transitioning RM to Standby mode}}
> from the run method of {{{}StandByTransitionRunnable{}}}:
> [https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java#L1195.]
> Related to this, a jstack of the RM shows that thread being {{RUNNABLE}},
> but evidently not making progress:
> !image-2022-06-14-16-44-45-101.png!
> So the RM doesn't work because it is fenced, but remains active, which causes
> an outage until a failover is manually initiated.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]