[
https://issues.apache.org/jira/browse/YARN-11839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Nauroth resolved YARN-11839.
----------------------------------
Fix Version/s: 3.3.7
3.5.0
3.4.3
Hadoop Flags: Reviewed
Resolution: Fixed
> [RM HA] - In corner case, RM stay in ACTIVE with RMStateStore in FENCED state
> -----------------------------------------------------------------------------
>
> Key: YARN-11839
> URL: https://issues.apache.org/jira/browse/YARN-11839
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 3.3.6, 3.4.2
> Reporter: Vinayakumar B
> Assignee: Vinayakumar B
> Priority: Critical
> Labels: pull-request-available
> Fix For: 3.3.7, 3.5.0, 3.4.3
>
>
> In a corner case involved with the following events RM will stay in ACTIVE,
> but RMStateStore in FENCED state.
> # initially RM in ACTIVE state.
> # An event triggered to `transitionToStandby()` on RM.
> # during *reinitialize(true)* in RM, CapacitySchduler created. BUT not
> inited yet.
> # Another `{*}transitionToActive(){*}` triggered from zk re-election which
> triggered `{*}reinitialize(){*}` on CapacityScheduler, resulting in
> `{*}NullPointerException{*}` and in-turn generating
> `{*}RMFatalEventType.TRANSITION_TO_ACTIVE_FAILED{*}`
> # This triggered `{*}StandByTransitionRunnable{*}` runnable and set the flag
> `{*}hasAlreadyRun=true{*}`, even though RM was already STANDBY at this stage.
> # This state continued for sometime.
> # After sometime RM became active after re-election. But this time
> `{*}StandByTransitionRunnable#hasAlreadyRun{*}` is still true.
> # Now, due to ZK unstable, RMStateStore met with ZK error and went to
> *FENCED* state.
> # This again triggered `{*}StandByTransitionRunnable{*}` runnable.
> # Now, due the flag, `{*}StandByTransitionRunnable{*}` silently exited.
> # RM continued to stay in *ACTIVE* with RMStateStore in *FENCED* state.
> # All new applications are continued to stay in *NEW_SAVING* state and no
> more state changes in any of the applications.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]