Vinayakumar B created YARN-11839: ------------------------------------ Summary: [RM HA] - In corner case, RM stay in ACTIVE with RMStateStore in FENCED state Key: YARN-11839 URL: https://issues.apache.org/jira/browse/YARN-11839 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Vinayakumar B
In a corner case involved with the following events RM will stay in ACTIVE, but RMStateStore in FENCED state. # initially RM in ACTIVE state. # An event triggered to `transitionToStandby()` on RM. # during *reinitialize(true)* in RM, CapacitySchduler created. BUT not inited yet. # Another `{*}transitionToActive(){*}` command for triggered from Admin cli, which triggered `{*}reinitialize(){*}` on CapacityScheduler, resulting in `{*}NullPointerException{*}` and in-turn generating `{*}RMFatalEventType.TRANSITION_TO_ACTIVE_FAILED{*}` # This triggered `{*}StandByTransitionRunnable{*}` runnable and set the flag `{*}hasAlreadyRun=true{*}`, even though RM was already STANDBY at this stage. # This state continued for sometime. # After sometime RM became active after re-election. But this time `{*}StandByTransitionRunnable#hasAlreadyRun{*}` is still true. # Now, due to ZK unstable, RMStateStore met with ZK error and went to *FENCED* state. # This again triggered `{*}StandByTransitionRunnable{*}` runnable. # Now, due the flag, `{*}StandByTransitionRunnable{*}` silently exited. # RM continued to stay in *ACTIVE* with RMStateStore in *FENCED* state. # All new applications are continued to stay in *NEW_SAVING* state and no more state changes in any of the applications. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org