[
https://issues.apache.org/jira/browse/YARN-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232204#comment-14232204
]
Tsuyoshi OZAWA commented on YARN-2874:
--------------------------------------
[~Naganarasimha] Thanks for you reporting. I dived into the code. I think this
dead lock can be caused following code path:
1. delayedRemovalThread.start > removeApplicationFromRenewal() > *synchronize
(delegationTokens) {}* > *dttr.timerTask.cancel()* >
DelegationTokenRenewerRunnable#handleDTRenewerAppSubmitEvent() >
handleAppSubmitEvent > addTokenToList() >
*delegationTokens(Collections$SynchronizedSet)*
2. renewalTimer.schedule() > RenewalTimerTask#run >
removeFailedDelegationToken > *tr.timerTask.cancel()* >
DelegationTokenRenewerRunnable#handleDTRenewerAppSubmitEvent >
handleAppSubmitEvent > addTokenToList >
*delegationTokens(Collections$SynchronizedSet)*
The current code path is as follows:
1. delayedRemovalThread.start > removeApplicationFromRenewal() >
*synchronize(tokenSet){}* > *dttr.timerTask.cancel()* >
DelegationTokenRenewerRunnable#handleDTRenewerAppSubmitEvent() >
handleAppSubmitEvent > *appTokens.get(applicationId).add(dtr)* # appTokens.get
looks same to tokenSet
2. renewalTimer.schedule() > RenewalTimerTask#run >
removeFailedDelegationToken > *tr.timerTask.cancel()* >
DelegationTokenRenewerRunnable#handleDTRenewerAppSubmitEvent >
handleAppSubmitEvent > *appTokens.get(applicationId).add(dtr)*
The cause of this issue is that the lock order between tokenSet and timerTask.
I think the fix by Naganarasimha works well in this case. [~jlowe], [~kasha],
please let me know if I'm wrong.
> Dead lock in "DelegationTokenRenewer" which blocks RM to execute any further
> apps
> ---------------------------------------------------------------------------------
>
> Key: YARN-2874
> URL: https://issues.apache.org/jira/browse/YARN-2874
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.5.1
> Reporter: Naganarasimha G R
> Assignee: Naganarasimha G R
> Priority: Blocker
> Attachments: YARN-2874.20141118-1.patch, YARN-2874.20141118-2.patch
>
>
> When token renewal fails and the application finishes this dead lock can occur
> Jstack dump :
> {quote}
> Found one Java-level deadlock:
> =============================
> "DelegationTokenRenewer #181865":
> waiting to lock monitor 0x0000000000900918 (object 0x00000000c18a9998, a
> java.util.Collections$SynchronizedSet),
> which is held by "DelayedTokenCanceller"
> "DelayedTokenCanceller":
> waiting to lock monitor 0x0000000004141718 (object 0x00000000c7eae720, a
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$RenewalTimerTask),
> which is held by "Timer-4"
> "Timer-4":
> waiting to lock monitor 0x0000000000900918 (object 0x00000000c18a9998, a
> java.util.Collections$SynchronizedSet),
> which is held by "DelayedTokenCanceller"
>
> Java stack information for the threads listed above:
> ===================================================
> "DelegationTokenRenewer #181865":
> at java.util.Collections$SynchronizedCollection.add(Collections.java:1636)
> - waiting to lock <0x00000000c18a9998> (a
> java.util.Collections$SynchronizedSet)
> at
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.addTokenToList(DelegationTokenRenewer.java:322)
> at
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:398)
> at
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$500(DelegationTokenRenewer.java:70)
> at
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:657)
> at
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:638)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> "DelayedTokenCanceller":
> at
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$RenewalTimerTask.cancel(DelegationTokenRenewer.java:443)
> - waiting to lock <0x00000000c7eae720> (a
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$RenewalTimerTask)
> at
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.removeApplicationFromRenewal(DelegationTokenRenewer.java:558)
> - locked <0x00000000c18a9998> (a java.util.Collections$SynchronizedSet)
> at
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$300(DelegationTokenRenewer.java:70)
> at
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelayedTokenRemovalRunnable.run(DelegationTokenRenewer.java:599)
> at java.lang.Thread.run(Thread.java:745)
> "Timer-4":
> at java.util.Collections$SynchronizedCollection.remove(Collections.java:1639)
> - waiting to lock <0x00000000c18a9998> (a
> java.util.Collections$SynchronizedSet)
> at
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.removeFailedDelegationToken(DelegationTokenRenewer.java:503)
> at
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$100(DelegationTokenRenewer.java:70)
> at
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$RenewalTimerTask.run(DelegationTokenRenewer.java:437)
> - locked <0x00000000c7eae720> (a
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$RenewalTimerTask)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
>
> Found 1 deadlock.
> {quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)