[ https://issues.apache.org/jira/browse/YARN-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241114#comment-14241114 ]
Rohith commented on YARN-2946: ------------------------------ Extracted Jstack detect deadlock {noformat} Found one Java-level deadlock: ============================= "org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread": waiting to lock monitor 0x0000000000a39138 (object 0x00000000c0234980, a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine), which is held by "AsyncDispatcher event handler" "AsyncDispatcher event handler": waiting to lock monitor 0x0000000000a391e8 (object 0x00000000c02347e0, a org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore), which is held by "org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread" Java stack information for the threads listed above: =================================================== "org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread": at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) - waiting to lock <0x00000000c0234980> (a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateFencedState(RMStateStore.java:449) - locked <0x00000000c02347e0> (a org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.notifyStoreOperationFailed(RMStateStore.java:713) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread.run(ZKRMStateStore.java:1030) "AsyncDispatcher event handler": at java.lang.Object.wait(Native Method) - waiting on <0x00000000c02347e0> (a org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1043) - locked <0x00000000c02347e0> (a org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1070) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:975) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:667) - locked <0x00000000c02347e0> (a org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:246) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:231) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) - locked <0x00000000c0234980> (a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:699) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:754) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:749) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) Found 1 deadlock. {noformat} > Deadlock in ZKRMStateStore > -------------------------- > > Key: YARN-2946 > URL: https://issues.apache.org/jira/browse/YARN-2946 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Reporter: Rohith > Assignee: Rohith > > Found one deadlock in ZKRMStateStore. > # Initial stage zkClient is null because of zk disconnected event. > # When ZKRMstatestore#runWithCheck() wait(zkSessionTimeout) for zkClient to > re establish zookeeper connection either via synconnected or expired event, > it is highly possible that any other thred can obtain lock on > {{ZKRMStateStore.this}} from state machine transition events. This cause > Deadlock in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)