wangzhihui created YARN-11622:
---------------------------------
Summary: ResourceManager asynchronous switch to Standy、Active
exception
Key: YARN-11622
URL: https://issues.apache.org/jira/browse/YARN-11622
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 3.0.0
Reporter: wangzhihui
Attachments: rm_ha_solution.png, yuque_diagram (1).jpg,
yuque_diagram.jpg
h1. Two exception cases:
h2. The first case:
*The exception desc:*
14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) -
Error in dispatcher thread
java.lang.NullPointerException
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:748){{}} * ActiveStandbyElector and
ZKRMStateStore triggered toStandy event at 14:52:57,
Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
* As shown in the following figure, Thread_1 during the toStandby process ,
reinitializes the activeServices to null. At this point, Thread_2 will use the
"activeServices" when executing the handleTransitionToStandByInNewThread method
ultimately resulting in a NullPointerException and the Reosurcemanager server
exit.
!yuque_diagram.jpg|width=629,height=100!
h2. The second case:
*The exception desc:*
06:17:35,913 WARN ha.ActiveStandbyElector
(ActiveStandbyElector.java:becomeActive(900)) - Exception handling the winning
of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll
during transition to Active
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
at
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
... 4 more
Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation
failed
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
... 5 more
Caused by: java.lang.NullPointerException
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
... 6 more
06:17:35,917 ERROR resourcemanager.ResourceManager
(ResourceManager.java:handle(898)) - Received RMFatalEvent of type
TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration
settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
tion failed{{}}
* ActiveStandbyElector and ZKRMStateStore triggered toActive event and
toStandby event at 06:17:35, Two asynchronous events are respectively referred
to as Thread_ 1、Thread_ 2.
* During the execution of Thread_ 1 the CapacityScheduler.reinitialize is
called to refresh the Scheduler configuration. At this time, the csConfProvider
property of the CapacityScheduler is not initialized and its value is null. As
a result. when the reinitialize method is executed csConfProvider is used,
triggering a NullPointerException and causing Thread_ 1 transition to active
fail.
!yuque_diagram (1).jpg|width=568,height=155!
h1. Solution
Due to the limited scope of lock control in ResourceMmanger’s
transitionToActive and transitionToStandby methods, different events triggered
asynchronously outside this lock scope can influence each other, leading to
unpredictable issues. The proposed solution is to encapsulate different
asynchronous tasks as TransitionToActiveStandbyRunner and enqueue them in a
queue to be executed in order by a SingleThreadExecutor. This approach resolves
the asynchronous problem and provides clearer and more controllable switching
of to active and standby processes.
!rm_ha_solution.png|width=362,height=353!
h2. TransitionToActiveStandbyRunner and Subclasses
h3. TransitionToActiveStandbyRunner
* TransitionToActiveStandbyRunner is a template class where the logic for
different scenarios is placed and executed within the doTransaction method.
public abstract class TransitionToActiveStandbyRunner implements
Callable<TransitionToActiveStandbyResult> \{
@Override
public TransitionToActiveStandbyResult call() throws Exception {
... before log ...
TransitionToActiveStandbyResult result = doTransaction();
... after log ...
return result;
}
public abstract TransitionToActiveStandbyResult doTransaction();
}{{}}
h3. Subclasses
*AdminServiceToActiveRunner*
AdminServiceToActiveRunner encapsulates the logic of the transitionToActive
method in AdminService, handling the requests from clients and
ActiveStandbyElector to transition to the active state.
*AdminServiceToStandbyRunner*
AdminServiceToStandbyRunner encapsulates the logic of the transitionToStandby
method in AdminService, handling the requests from clients and
ActiveStandbyElector to transition to the standby state.
*RmStartAndStopToStandby*
RmStartAndStopToStandby is used for transitioning the ResourceManager service
to standby when it is stopping or starting
*RMStartToActiveRunner*
RMStartToActiveRunner is used for transitioning the ResourceManager service to
active when it is stopping.
RMFatalToStandbyRunner: RMFatalToStandbyRunner is used to handle RMFatalEvent
during Yarn open HA mode for transitioning to standby.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]