[ 
https://issues.apache.org/jira/browse/YARN-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-9712:
-------------------------------
    Attachment: YARN-9712.001.patch

> ResourceManager goes into a deadlock while transitioning to standby
> -------------------------------------------------------------------
>
>                 Key: YARN-9712
>                 URL: https://issues.apache.org/jira/browse/YARN-9712
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager, RM
>    Affects Versions: 2.9.0
>            Reporter: Tarun Parimi
>            Assignee: Tarun Parimi
>            Priority: Major
>         Attachments: YARN-9712.001.patch
>
>
> We have observed RM go into a deadlock while transitioning to standby in a 
> heavily loaded production cluster which can observe random connection loss to 
> a zookeeper session and also has a large amount of RMDelegationToken requests 
> due to oozie jobs.
> On analyzing the jstack and the logs, this seems to happen when the below 
> sequence of events occur.
> 1. Zookeeper session is lost and so the ActiveStandbyElector service will do 
> transitionToStandby . This transitionToStandby is a synchronized method and 
> so will acquire a lock on ResourceManager. 
> {code:java}
> 2019-07-25 14:31:24,497 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:processWatchEvent(621)) - Session expired. 
> Entering neutral mode and rejoining... 
> 2019-07-25 14:31:28,084 INFO resourcemanager.ResourceManager 
> (ResourceManager.java:transitionToStandby(1134)) - Transitioning to standby 
> state 
> {code}
> 2. While transitioning to standby, a java.lang.InterruptedException occurs in 
> RMStateStore while removing/storing RMDelegationToken. This is because 
> RMSecretManagerService will be stopped while transitioning to standby.
> {code:java}
> 2019-07-25 14:31:28,576 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(373)) - Error While Removing RMDelegationToken 
> and SequenceNumber
> java.lang.InterruptedException
> 2019-07-25 14:31:28,576 ERROR recovery.RMStateStore 
> (RMStateStore.java:notifyStoreOperationFailedInternal(992)) - State store 
> operation failed 
> java.lang.InterruptedException 
> {code}
> 3. When state store error occurs, a RMFatalEvent of type STATE_STORE_FENCED 
> will be sent. 
> {code:java}
> 2019-07-25 14:31:28,579 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(767)) - Received RMFatalEvent of type 
> STATE_STORE_FENCED, caused by java.lang.InterruptedException 
> {code}
> 4. The problem occurs when the RMFatalEventDispatcher calls getConfig() . 
> This also needs a lock on ResourceManager since its a synchronized method. 
> This will cause the rmDispatcher eventHandlingThread to become blocked.
> {code:java}
> private class RMFatalEventDispatcher implements EventHandler<RMFatalEvent> {
>     @Override
>     public void handle(RMFatalEvent event) {
>       LOG.error("Received " + event);
>       if (HAUtil.isHAEnabled(getConfig())) {
>         // If we're in an HA config, the right answer is always to go into
>         // standby.
>         LOG.warn("Transitioning the resource manager to standby.");
>         handleTransitionToStandByInNewThread();
> {code}
> 5. The transitionToStandby will wait forever as the eventHandlingThread of 
> rmDispatcher is blocked. This causes a deadlock and RM will not become active 
> until restarted.
> Below are the relevant threads in the jstack captured.
> The transitionToStandby thread that waits forever.
> {code:java}
> "main-EventThread" #138239 daemon prio=5 os_prio=0 tid=0x00007fea473b2800 
> nid=0x2f411 in Object.wait() [0x00007fda5bef5000]
>    java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Thread.join(Thread.java:1245)
>         - locked <0x00007fdb6c5059a0> (a java.lang.Thread)
>         at java.lang.Thread.join(Thread.java:1319)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:161)
>         at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>         - locked <0x00007fdb6c538ca0> (a java.lang.Object)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.resetRMContext(ResourceManager.java:1323)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(ResourceManager.java:1091)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1139)
>         - locked <0x00007fdb33e418f0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:355)
>         - locked <0x00007fdb33e41828> (a 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:147)
>         at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:970)
>         at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:480)
>         - locked <0x00007fdb33e7bb88> (a 
> org.apache.hadoop.ha.ActiveStandbyElector)
>         at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
>         at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>    Locked ownable synchronizers:
>         - None
> {code}
> The blocked rmDispatcher EventHandler.
> {code:java}
> "AsyncDispatcher event handler" #135565 daemon prio=5 os_prio=0 
> tid=0x00007fdb2107f000 nid=0x2484a waiting for monitor entry 
> [0x00007fda597cc000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.hadoop.service.AbstractService.getConfig(AbstractService.java:403)
>         - waiting to lock <0x00007fdb33e418f0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:769)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:764)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
>         at java.lang.Thread.run(Thread.java:745)
>    Locked ownable synchronizers:
>         - None
> {code}
> This scenario will happen only when having the changes introduced in 
> YARN-3742 where RMFatalEventDispatcher handles ERROR scenarios such as 
> STATE_STORE_FENCED and tries to transitionToStandby.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to