[ 
https://issues.apache.org/jira/browse/YARN-11839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved YARN-11839.
----------------------------------
    Fix Version/s: 3.3.7
                   3.5.0
                   3.4.3
     Hadoop Flags: Reviewed
       Resolution: Fixed

> [RM HA] - In corner case, RM stay in ACTIVE with RMStateStore in FENCED state
> -----------------------------------------------------------------------------
>
>                 Key: YARN-11839
>                 URL: https://issues.apache.org/jira/browse/YARN-11839
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.3.6, 3.4.2
>            Reporter: Vinayakumar B
>            Assignee: Vinayakumar B
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 3.3.7, 3.5.0, 3.4.3
>
>
> In a corner case involved with the following events RM will stay in ACTIVE, 
> but RMStateStore in FENCED state.
>  # initially RM in ACTIVE state.
>  # An event triggered to `transitionToStandby()` on RM.
>  # during *reinitialize(true)* in RM, CapacitySchduler created. BUT not 
> inited yet.
>  # Another `{*}transitionToActive(){*}`  triggered from zk re-election which 
> triggered `{*}reinitialize(){*}` on CapacityScheduler, resulting in 
> `{*}NullPointerException{*}` and in-turn generating 
> `{*}RMFatalEventType.TRANSITION_TO_ACTIVE_FAILED{*}`
>  # This triggered `{*}StandByTransitionRunnable{*}` runnable and set the flag 
> `{*}hasAlreadyRun=true{*}`, even though RM was already STANDBY at this stage.
>  # This state continued for sometime.
>  # After sometime RM became active after re-election. But this time 
> `{*}StandByTransitionRunnable#hasAlreadyRun{*}` is still true.
>  # Now, due to ZK unstable, RMStateStore met with ZK error and went to 
> *FENCED* state.
>  # This again triggered `{*}StandByTransitionRunnable{*}` runnable.
>  # Now, due the flag, `{*}StandByTransitionRunnable{*}` silently exited.
>  # RM continued to stay in *ACTIVE* with RMStateStore in *FENCED* state.
>  # All new applications are continued to stay in *NEW_SAVING* state and no 
> more state changes in any of the applications.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to