[jira] [Created] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception

wangzhihui (Jira) Sun, 03 Dec 2023 03:39:05 -0800

wangzhihui created YARN-11622:
---------------------------------

             Summary: ResourceManager asynchronous switch to Standy、Active 
exception
                 Key: YARN-11622
                 URL: https://issues.apache.org/jira/browse/YARN-11622
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 3.0.0
            Reporter: wangzhihui
         Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
yuque_diagram.jpg


h1. Two exception cases：
h2. The first case：

*The exception desc:* 
14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - 
Error in dispatcher thread
java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
        at java.lang.Thread.run(Thread.java:748){{}} * ActiveStandbyElector and 
ZKRMStateStore triggered toStandy event at 14:52:57,

Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
 * As shown in the following figure， Thread_1 during the toStandby process ， 
reinitializes the activeServices to null. At this point, Thread_2 will use the 
"activeServices" when executing the handleTransitionToStandByInNewThread method 
ultimately resulting in a NullPointerException and the Reosurcemanager server 
exit.

 !yuque_diagram.jpg|width=629,height=100!

h2. The second case：

*The exception desc:* 
06:17:35,913 WARN  ha.ActiveStandbyElector 
(ActiveStandbyElector.java:becomeActive(900)) - Exception handling the winning 
of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
        at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
        at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
        at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
during transition to Active
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
        ... 4 more
Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
failed
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
        ... 5 more
Caused by: java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
        ... 6 more
06:17:35,917 ERROR resourcemanager.ResourceManager 
(ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
tion failed{{}}

 * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
toStandby event at 06:17:35, Two asynchronous events are respectively referred 
to as Thread_ 1、Thread_ 2.
 * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
called to refresh the Scheduler configuration. At this time, the csConfProvider 
property of the CapacityScheduler is not initialized and its value is null. As 
a result. when the reinitialize method is executed csConfProvider is used, 
triggering a NullPointerException and causing Thread_ 1 transition to active 
fail.

 !yuque_diagram (1).jpg|width=568,height=155!

h1. Solution

Due to the limited scope of lock control in ResourceMmanger’s 
transitionToActive and transitionToStandby methods, different events triggered 
asynchronously outside this lock scope can influence each other, leading to 
unpredictable issues. The proposed solution is to encapsulate different 
asynchronous tasks as TransitionToActiveStandbyRunner and enqueue them in a 
queue to be executed in order by a SingleThreadExecutor. This approach resolves 
the asynchronous problem and provides clearer and more controllable switching 
of to active and standby processes.

!rm_ha_solution.png|width=362,height=353!


h2. TransitionToActiveStandbyRunner and Subclasses
h3. TransitionToActiveStandbyRunner
 * TransitionToActiveStandbyRunner is a template class where the logic for 
different scenarios is placed and executed within the doTransaction method.

public abstract class TransitionToActiveStandbyRunner implements  
Callable<TransitionToActiveStandbyResult> \{

    @Override
    public TransitionToActiveStandbyResult call() throws Exception {
        ... before log ...
     TransitionToActiveStandbyResult result = doTransaction();
        ... after log ...
        return result;
    }

    public abstract  TransitionToActiveStandbyResult  doTransaction();

}{{}}

h3. Subclasses

*AdminServiceToActiveRunner*

AdminServiceToActiveRunner encapsulates the logic of the transitionToActive 
method in AdminService, handling the requests from clients and 
ActiveStandbyElector to transition to the active state.



*AdminServiceToStandbyRunner*

AdminServiceToStandbyRunner encapsulates the logic of the transitionToStandby 
method in AdminService, handling the requests from clients and 
ActiveStandbyElector to transition to the standby state.



*RmStartAndStopToStandby*

RmStartAndStopToStandby is used for transitioning the ResourceManager service 
to standby when it is stopping or starting

 



*RMStartToActiveRunner*

 RMStartToActiveRunner is used for transitioning the ResourceManager service to 
active when it is stopping.



RMFatalToStandbyRunner： RMFatalToStandbyRunner is used to handle RMFatalEvent 
during Yarn open HA mode for transitioning to standby.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-11622) ResourceManager asynchronous switch to Standy、Active exception

Reply via email to