[
https://issues.apache.org/jira/browse/YARN-5920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15684696#comment-15684696
]
Varun Saxena commented on YARN-5920:
------------------------------------
This test is failing due to a deadlock.
When RM transitions to active, we store RM delegation token master key in state
store. For this we put the state store event in AsyncDispatcher.
After event is picked up from AsyncDispatcher, we call
RMStateStore#handleStoreEvent where we acquire a write lock. Then from
StoreRMDTMasterKeyTransition, we will call
MemoryRMStateStore#storeRMDTMasterKeyState which is a synchronized method.
Now in TestRMHA, we override updateApplicationState in MemoryRMStateStore which
is also synchronized. By overriding this method, we are bypassing RMStateStore
i.e. when in test we call
{{rm.getRMContext().getStateStore().updateApplicationState(null)}}, we do not
try to acquire write lock in RMStateStore. When updateApplicationState calls
notifyStoreOperationFailed, we will call RMStateStore#isFencedState which leads
to acquiring of read lock or call RMStateStore#updateFencedState which will
lead to acquiring of write lock.
Now due to race, if MemoryRMStateStore#updateApplicationState is called before
MemoryRMStateStore#storeRMDTMasterKeyState is called but after
RMStateStore#storeRMDTMasterKey is called, there can be a deadlock.
This is because the thread calling notifyStoreOperationFailed would be blocked
while trying to acquire read or write lock in RMStateStore because a write lock
is held by thread storing RM DT master key. Whereas the thread calling
MemoryRMStateStore#storeRMDTMasterKeyState will be blocked because access to
MemoryRMStateStore#updateApplicationState is synchronized and that thread is
blocked on the read/write lock.
To solve this we should override updateApplicationStateInternal in
MemoryRMStateStore and RMStateStore#updateApplicationState should be invoked so
that normal flow of processing state store events is followed. This will get
rid of deadlock.
This deadlock can be easily simulated by putting a sleep in
StoreRMDTMasterKeyTransition#transition.
> TestRMHA.testTransitionedToStandbyShouldNotHang is flaky
> --------------------------------------------------------
>
> Key: YARN-5920
> URL: https://issues.apache.org/jira/browse/YARN-5920
> Project: Hadoop YARN
> Issue Type: Bug
> Components: test
> Reporter: Rohith Sharma K S
> Assignee: Varun Saxena
> Attachments: ThreadDump.txt, YARN-5920.01.patch
>
>
> In build
> [linkg|https://builds.apache.org/job/PreCommit-YARN-Build/13986/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt]
> test case timed out. This need to be investigated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]