[ https://issues.apache.org/jira/browse/YARN-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16895033#comment-16895033 ]
Tarun Parimi edited comment on YARN-9712 at 7/29/19 8:05 AM: ------------------------------------------------------------- {quote}2. While transitioning to standby, a java.lang.InterruptedException occurs in RMStateStore while removing/storing RMDelegationToken. This is because RMSecretManagerService will be stopped while transitioning to standby. {quote} Looks like this scenario can be prevented with the fix in YARN-6647 from version 3.0.0 onwards. was (Author: tarunparimi): bq. 2. While transitioning to standby, a java.lang.InterruptedException occurs in RMStateStore while removing/storing RMDelegationToken. This is because RMSecretManagerService will be stopped while transitioning to standby. Looks like this scenario can prevented with the fix in YARN-6647. > ResourceManager goes into a deadlock while transitioning to standby > ------------------------------------------------------------------- > > Key: YARN-9712 > URL: https://issues.apache.org/jira/browse/YARN-9712 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, RM > Affects Versions: 2.9.0 > Reporter: Tarun Parimi > Priority: Major > > We have observed RM go into a deadlock while transitioning to standby in a > heavily loaded production cluster which can observe random connection loss to > a zookeeper session and also has a large amount of RMDelegationToken requests > due to oozie jobs. > On analyzing the jstack and the logs, this seems to happen when the below > sequence of events occur. > 1. Zookeeper session is lost and so the ActiveStandbyElector service will do > transitionToStandby . This transitionToStandby is a synchronized method and > so will acquire a lock on ResourceManager. > {code:java} > 2019-07-25 14:31:24,497 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:processWatchEvent(621)) - Session expired. > Entering neutral mode and rejoining... > 2019-07-25 14:31:28,084 INFO resourcemanager.ResourceManager > (ResourceManager.java:transitionToStandby(1134)) - Transitioning to standby > state > {code} > 2. While transitioning to standby, a java.lang.InterruptedException occurs in > RMStateStore while removing/storing RMDelegationToken. This is because > RMSecretManagerService will be stopped while transitioning to standby. > {code:java} > 2019-07-25 14:31:28,576 ERROR recovery.RMStateStore > (RMStateStore.java:transition(373)) - Error While Removing RMDelegationToken > and SequenceNumber > java.lang.InterruptedException > 2019-07-25 14:31:28,576 ERROR recovery.RMStateStore > (RMStateStore.java:notifyStoreOperationFailedInternal(992)) - State store > operation failed > java.lang.InterruptedException > {code} > 3. When state store error occurs, a RMFatalEvent of type STATE_STORE_FENCED > will be sent. > {code:java} > 2019-07-25 14:31:28,579 ERROR resourcemanager.ResourceManager > (ResourceManager.java:handle(767)) - Received RMFatalEvent of type > STATE_STORE_FENCED, caused by java.lang.InterruptedException > {code} > 4. The problem occurs when the RMFatalEventDispatcher calls getConfig() . > This also needs a lock on ResourceManager since its a synchronized method. > This will cause the rmDispatcher eventHandlingThread to become blocked. > {code:java} > private class RMFatalEventDispatcher implements EventHandler<RMFatalEvent> { > @Override > public void handle(RMFatalEvent event) { > LOG.error("Received " + event); > if (HAUtil.isHAEnabled(getConfig())) { > // If we're in an HA config, the right answer is always to go into > // standby. > LOG.warn("Transitioning the resource manager to standby."); > handleTransitionToStandByInNewThread(); > {code} > 5. The transitionToStandby will wait forever as the eventHandlingThread of > rmDispatcher is blocked. This causes a deadlock and RM will not become active > until restarted. > Below are the relevant threads in the jstack captured. > The transitionToStandby thread that waits forever. > {code:java} > "main-EventThread" #138239 daemon prio=5 os_prio=0 tid=0x00007fea473b2800 > nid=0x2f411 in Object.wait() [0x00007fda5bef5000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1245) > - locked <0x00007fdb6c5059a0> (a java.lang.Thread) > at java.lang.Thread.join(Thread.java:1319) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:161) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > - locked <0x00007fdb6c538ca0> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.resetRMContext(ResourceManager.java:1323) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(ResourceManager.java:1091) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1139) > - locked <0x00007fdb33e418f0> (a > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:355) > - locked <0x00007fdb33e41828> (a > org.apache.hadoop.yarn.server.resourcemanager.AdminService) > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:147) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:970) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:480) > - locked <0x00007fdb33e7bb88> (a > org.apache.hadoop.ha.ActiveStandbyElector) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > Locked ownable synchronizers: > - None > {code} > The blocked rmDispatcher EventHandler. > {code:java} > "AsyncDispatcher event handler" #135565 daemon prio=5 os_prio=0 > tid=0x00007fdb2107f000 nid=0x2484a waiting for monitor entry > [0x00007fda597cc000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.service.AbstractService.getConfig(AbstractService.java:403) > - waiting to lock <0x00007fdb33e418f0> (a > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:769) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:764) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > Locked ownable synchronizers: > - None > {code} > This scenario will happen only when having the changes introduced in > YARN-3742 where RMFatalEventDispatcher handles ERROR scenarios such as > STATE_STORE_FENCED and tries to transitionToStandby. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org