[ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:
----------------------------
    Fix Version/s: 2.9.2

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --------------------------------------------------------------------------------------
>
>                 Key: YARN-9151
>                 URL: https://issues.apache.org/jira/browse/YARN-9151
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.9.2
>            Reporter: Yuqi Wang
>            Assignee: Yuqi Wang
>            Priority: Major
>              Labels: patch
>             Fix For: 3.1.1, 2.9.2
>
>         Attachments: YARN-9151-branch-2.9.2.001.patch, yarn_rm.zip
>
>
> *Issue Summary:*
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> *Issue Repro Steps:*
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS. 
> (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> *Issue Logs:*
> The RM is BN4SCH101222318
> You can check the full RM log in attachment, yarn_rm.zip.
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
>     Start RMActiveServices 
>     Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
>     Stop CommonNodeLabelsManager
>     Stop RMActiveServices
>     Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException 
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>    
> (The standby RM failed to re-join the election, but it will never retry or 
> crash later, so afterwards no zk related logs and the standby RM is forever 
> hang.)
> {noformat}
> So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> *Caused By:*
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author said:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> *What the Patch's solution:*
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
>  And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types, should crash RM, because we *cannot ensure* that 
> they will never cause RM cannot work in standby state, the *conservative* way 
> is to crash RM. Besides, after crash, the RM watchdog can know this and try 
> to repair the RM machine, send alerts, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to