[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15018159#comment-15018159 ]
Tsuyoshi Ozawa commented on YARN-4348: -------------------------------------- Found that this is caused by the lock ordering. 1. (In main thread of RM) locking ZKRMStateStore(startInternal) -> waiting for lock.await() 2. ZK's eventThread: Got SyncConnected event from ZK -> Calling ForwardingWatcher#process -> processWatchEvent called, but ZKRMStateStore has been locked since 1 3. (In main thread of RM) timeout and IOException -> unlocking ZKRMStateStore() -> the callback, processEvent, of sync is fired. I will attach a patch to address this problem. > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > ---------------------------------------------------------------------------------------- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.7.2, 2.6.2 > Reporter: Tsuyoshi Ozawa > Assignee: Tsuyoshi Ozawa > Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, YARN-4348.001.patch, > YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)