[jira] [Created] (YARN-11184) fenced active RM not failing over correctly in HA setup

Steven Rand (Jira) Tue, 14 Jun 2022 13:47:07 -0700

Steven Rand created YARN-11184:
----------------------------------

             Summary: fenced active RM not failing over correctly in HA setup
                 Key: YARN-11184
                 URL: https://issues.apache.org/jira/browse/YARN-11184
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 3.2.3
            Reporter: Steven Rand
         Attachments: image-2022-06-14-16-38-00-336.png, 
image-2022-06-14-16-39-50-278.png, image-2022-06-14-16-41-39-742.png, 
image-2022-06-14-16-44-45-101.png

We've observed an issue recently on a production cluster running 3.2.3 in which
a fenced Resource Manager remains active, but does not communicate with the ZK
state store, and therefore cannot function correctly. This did not occur while
running 3.2.2 on the same cluster.

In more detail, what seems to happen is:

1. The active RM gets a {{NodeExists}} error from ZK while storing an app in
the state store. I suspect that this is caused by some transient connection
issue that causes the first node creation request to succeed, but for the
response to not reach the RM, triggering a duplicate request which fails with
this error.

!image-2022-06-14-16-38-00-336.png!

2. Because of this error, the active RM is fenced.

!image-2022-06-14-16-39-50-278.png!

3. Because it is fenced, the active RM starts to transition to standby.

!image-2022-06-14-16-41-39-742.png! 4. However, the RM never fully transitions
to standby. It never logs {{Transitioning RM to Standby mode}} from the run
method of {{{}StandByTransitionRunnable{}}}:
[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java#L1195.]
Related to this, a jstack of the RM shows that thread being {{RUNNABLE}}, but
evidently not making progress:

!image-2022-06-14-16-44-45-101.png!

So the RM doesn't work because it is fenced, but remains active, which causes
an outage until a failover is manually initiated.

--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11184) fenced active RM not failing over correctly in HA setup

Reply via email to