Vinayakumar B created YARN-11839:
------------------------------------

             Summary: [RM HA] - In corner case, RM stay in ACTIVE with 
RMStateStore in FENCED state
                 Key: YARN-11839
                 URL: https://issues.apache.org/jira/browse/YARN-11839
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
            Reporter: Vinayakumar B


In a corner case involved with the following events RM will stay in ACTIVE, but 
RMStateStore in FENCED state.
 # initially RM in ACTIVE state.
 # An event triggered to `transitionToStandby()` on RM.
 # during *reinitialize(true)* in RM, CapacitySchduler created. BUT not inited 
yet.
 # Another `{*}transitionToActive(){*}` command for triggered from Admin cli, 
which triggered `{*}reinitialize(){*}` on CapacityScheduler, resulting in 
`{*}NullPointerException{*}` and in-turn generating 
`{*}RMFatalEventType.TRANSITION_TO_ACTIVE_FAILED{*}`
 # This triggered `{*}StandByTransitionRunnable{*}` runnable and set the flag 
`{*}hasAlreadyRun=true{*}`, even though RM was already STANDBY at this stage.
 # This state continued for sometime.
 # After sometime RM became active after re-election. But this time 
`{*}StandByTransitionRunnable#hasAlreadyRun{*}` is still true.
 # Now, due to ZK unstable, RMStateStore met with ZK error and went to *FENCED* 
state.
 # This again triggered `{*}StandByTransitionRunnable{*}` runnable.
 # Now, due the flag, `{*}StandByTransitionRunnable{*}` silently exited.
 # RM continued to stay in *ACTIVE* with RMStateStore in *FENCED* state.
 # All new applications are continued to stay in *NEW_SAVING* state and no more 
state changes in any of the applications.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to