[ 
https://issues.apache.org/jira/browse/YARN-5920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15684696#comment-15684696
 ] 

Varun Saxena commented on YARN-5920:
------------------------------------

This test is failing due to a deadlock.

When RM transitions to active, we store RM delegation token master key in state 
store. For this we put the state store event in AsyncDispatcher.
After event is picked up from AsyncDispatcher, we call 
RMStateStore#handleStoreEvent where we acquire a write lock. Then from 
StoreRMDTMasterKeyTransition, we will call 
MemoryRMStateStore#storeRMDTMasterKeyState which is a synchronized method.

Now in TestRMHA, we override updateApplicationState in MemoryRMStateStore which 
is also synchronized. By overriding this method, we are bypassing RMStateStore 
i.e. when in test we call 
{{rm.getRMContext().getStateStore().updateApplicationState(null)}}, we do not 
try to acquire write lock in RMStateStore. When updateApplicationState calls 
notifyStoreOperationFailed, we will call RMStateStore#isFencedState which leads 
to acquiring of read lock or call RMStateStore#updateFencedState which will 
lead to acquiring of write lock.

Now due to race, if MemoryRMStateStore#updateApplicationState is called before 
MemoryRMStateStore#storeRMDTMasterKeyState is called but after 
RMStateStore#storeRMDTMasterKey is called, there can be a deadlock. 
This is because the thread calling notifyStoreOperationFailed would be blocked 
while trying to acquire read or write lock in RMStateStore because a write lock 
is held by thread storing RM DT master key. Whereas the thread calling 
MemoryRMStateStore#storeRMDTMasterKeyState will be blocked because access to 
MemoryRMStateStore#updateApplicationState is synchronized and that thread is 
blocked on the read/write lock.

To solve this we should override updateApplicationStateInternal in 
MemoryRMStateStore and RMStateStore#updateApplicationState should be invoked so 
that normal flow of processing state store events is followed. This will get 
rid of deadlock.

This deadlock can be easily simulated by putting a sleep in 
StoreRMDTMasterKeyTransition#transition.


> TestRMHA.testTransitionedToStandbyShouldNotHang is flaky
> --------------------------------------------------------
>
>                 Key: YARN-5920
>                 URL: https://issues.apache.org/jira/browse/YARN-5920
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: test
>            Reporter: Rohith Sharma K S
>            Assignee: Varun Saxena
>         Attachments: ThreadDump.txt, YARN-5920.01.patch
>
>
> In build 
> [linkg|https://builds.apache.org/job/PreCommit-YARN-Build/13986/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt]
>  test case timed out. This need to be investigated.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to