[ https://issues.apache.org/jira/browse/YARN-5920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15684696#comment-15684696 ]
Varun Saxena commented on YARN-5920: ------------------------------------ This test is failing due to a deadlock. When RM transitions to active, we store RM delegation token master key in state store. For this we put the state store event in AsyncDispatcher. After event is picked up from AsyncDispatcher, we call RMStateStore#handleStoreEvent where we acquire a write lock. Then from StoreRMDTMasterKeyTransition, we will call MemoryRMStateStore#storeRMDTMasterKeyState which is a synchronized method. Now in TestRMHA, we override updateApplicationState in MemoryRMStateStore which is also synchronized. By overriding this method, we are bypassing RMStateStore i.e. when in test we call {{rm.getRMContext().getStateStore().updateApplicationState(null)}}, we do not try to acquire write lock in RMStateStore. When updateApplicationState calls notifyStoreOperationFailed, we will call RMStateStore#isFencedState which leads to acquiring of read lock or call RMStateStore#updateFencedState which will lead to acquiring of write lock. Now due to race, if MemoryRMStateStore#updateApplicationState is called before MemoryRMStateStore#storeRMDTMasterKeyState is called but after RMStateStore#storeRMDTMasterKey is called, there can be a deadlock. This is because the thread calling notifyStoreOperationFailed would be blocked while trying to acquire read or write lock in RMStateStore because a write lock is held by thread storing RM DT master key. Whereas the thread calling MemoryRMStateStore#storeRMDTMasterKeyState will be blocked because access to MemoryRMStateStore#updateApplicationState is synchronized and that thread is blocked on the read/write lock. To solve this we should override updateApplicationStateInternal in MemoryRMStateStore and RMStateStore#updateApplicationState should be invoked so that normal flow of processing state store events is followed. This will get rid of deadlock. This deadlock can be easily simulated by putting a sleep in StoreRMDTMasterKeyTransition#transition. > TestRMHA.testTransitionedToStandbyShouldNotHang is flaky > -------------------------------------------------------- > > Key: YARN-5920 > URL: https://issues.apache.org/jira/browse/YARN-5920 > Project: Hadoop YARN > Issue Type: Bug > Components: test > Reporter: Rohith Sharma K S > Assignee: Varun Saxena > Attachments: ThreadDump.txt, YARN-5920.01.patch > > > In build > [linkg|https://builds.apache.org/job/PreCommit-YARN-Build/13986/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt] > test case timed out. This need to be investigated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org