[jira] [Commented] (YARN-11184) fenced active RM not failing over correctly in HA setup

Steven Rand (Jira) Tue, 14 Jun 2022 15:51:08 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-11184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554318#comment-17554318
 ]


Steven Rand commented on YARN-11184:
------------------------------------

Possibly [ZOOKEEPER-2251|https://issues.apache.org/jira/browse/ZOOKEEPER-2251] 
is related? The thread dump is different, but it appears to be a similar 
problem of the {{StandByTransitionThread}} waiting indefinitely for a response. 
The ZK version used client side by hadoop does not include the fix for that 
issue.

> fenced active RM not failing over correctly in HA setup
> -------------------------------------------------------
>
>                 Key: YARN-11184
>                 URL: https://issues.apache.org/jira/browse/YARN-11184
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.2.3
>            Reporter: Steven Rand
>            Priority: Major
>         Attachments: image-2022-06-14-16-38-00-336.png, 
> image-2022-06-14-16-39-50-278.png, image-2022-06-14-16-41-39-742.png, 
> image-2022-06-14-16-44-45-101.png
>
>
> We've observed an issue recently on a production cluster running 3.2.3 in 
> which a fenced Resource Manager remains active, but does not communicate with 
> the ZK state store, and therefore cannot function correctly. This did not 
> occur while running 3.2.2 on the same cluster.
> In more detail, what seems to happen is: 
> 1. The active RM gets a {{NodeExists}} error from ZK while storing an app in 
> the state store. I suspect that this is caused by some transient connection 
> issue that causes the first node creation request to succeed, but for the 
> response to not reach the RM, triggering a duplicate request which fails with 
> this error.
> !image-2022-06-14-16-38-00-336.png!
> 2. Because of this error, the active RM is fenced.
> !image-2022-06-14-16-39-50-278.png!
> 3. Because it is fenced, the active RM starts to transition to standby.
> !image-2022-06-14-16-41-39-742.png! 4. However, the RM never fully 
> transitions to standby. It never logs {{Transitioning RM to Standby mode}} 
> from the run method of {{{}StandByTransitionRunnable{}}}: 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java#L1195.]
>  Related to this, a jstack of the RM shows that thread being {{RUNNABLE}}, 
> but evidently not making progress:
>  !image-2022-06-14-16-44-45-101.png! 
> So the RM doesn't work because it is fenced, but remains active, which causes 
> an outage until a failover is manually initiated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11184) fenced active RM not failing over correctly in HA setup

Reply via email to