[
https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695578#comment-15695578
]
zhengchenyu edited comment on YARN-4090 at 11/26/16 6:30 AM:
-------------------------------------------------------------
here we see a dead block:
"IPC Server handler 98 on 8032" is waiting for lock (0x00007f42e17a5ed8)
"IPC Server handler 76 on 8032" got the lock (0x00007f42e17a5ed8), is is
waiting for lock (0x00007f42df3e8450)
"ResourceManager Event Processor" got the lock (0x00007f42df3e8450),is waiting
for lock (0x00007f42e17a5ed8)
In fact, 0x00007f42e17a5ed8 is a object lock of FSParentQueue, here I called
this root.Parent.
0x00007f42df3e8450 is another object lock of FSParentQueue, this is the child
queue object of 0x00007f42e17a5ed8. here I called this root.Parent.Child.
Let's trace these thread.
(1) ResourceManager Event Processor
{code}
FairScheduler.handle
FairScheduler.nodeUpdate
FairScheduler.completedContainer
FSAppAttempt.containerCompleted
FSLeafQueue.decResourceUsage
//got the lock 0x00007f42e0c7cf50
FSParentQueue.decResourceUsage
//got the lock 0x00007f42df3e8450 which is the object lock of
root.Parent.Child
FSParentQueue.decResourceUsage
//wait for 0x00007f42e17a5ed8 which is the object lock of
root.Parent
{code}
(2) IPC Server handler 76 on 8032
{code}
ClientRMService.getQueueUserAcls
FairScheduler.getQueueUserAclInfo
FSParentQueue.getQueueUserAclInfo
//got the lock 0x00007f42e17a5ed8
FSParentQueue.getQueueUserAclInfo
//wait for the lock 0x00007f42df3e8450
{code}
The left thread is unnecessary to analyse. Here we can see decResourceUsage got
the object lock from bottom to top, but getQueueUserAcls got the object lock
from top to bottom.
getQueueUserAcls got the object lock of root and root.Parent, and waits for
root.Parent.Child. But decResourceUsage got the object lock of
root.Parent.Child, and waits for root.Parnt. That's a deadlock.
I recommend that decResourceUsage is rewriten with the way of getting the
object lock from top to bottom. Another way is that choose ReadWriteLock to
take the place of object lock
was (Author: zhengchenyu):
here we see a dead block:
"IPC Server handler 98 on 8032" is waiting for lock (0x00007f42e17a5ed8)
"IPC Server handler 76 on 8032" got the lock (0x00007f42e17a5ed8), is is
waiting for lock (0x00007f42df3e8450)
"ResourceManager Event Processor" got the lock (0x00007f42df3e8450),is waiting
for lock (0x00007f42e17a5ed8)
In fact, 0x00007f42e17a5ed8 is a object lock of FSParentQueue, here I called
this root.Parent.
0x00007f42df3e8450 is another object lock of FSParentQueue, this is the child
queue object of 0x00007f42e17a5ed8. here I called this root.Parent.Child.
Let's trace these thread.
(1) ResourceManager Event Processor
{code}
FairScheduler.handle
FairScheduler.nodeUpdate
FairScheduler.completedContainer
FSAppAttempt.containerCompleted
FSLeafQueue.decResourceUsage
//got the lock 0x00007f42e0c7cf50
FSParentQueue.decResourceUsage
//got the lock 0x00007f42df3e8450 which is the object lock of
root.Parent.Child
FSParentQueue.decResourceUsage
//wait for 0x00007f42e17a5ed8 which is the object lock of
root.Parent
{code}
(2) IPC Server handler 76 on 8032
{code}
ClientRMService.getQueueUserAcls
FairScheduler.getQueueUserAclInfo
FSParentQueue.getQueueUserAclInfo
//got the lock 0x00007f42e17a5ed8
FSParentQueue.getQueueUserAclInfo
//wait for the lock 0x00007f42df3e8450
{code}
The left thread is unnecessary to analyse. Here we can see decResourceUsage got
the object lock from bottom to top, but getQueueUserAcls got the object lock
from top to bottom.
getQueueUserAcls got the object lock of root and root.Parent, and waits for
root.Parent.Child. But decResourceUsage got the object lock of
root.Parent.Child, and waits for root.Parnt. That's a deadlock.
I recommend that decResourceUsage is rewriten with the way of getting the
object lock from top to bottom. Another way is that choose ReadWriteLock to
take the place of object lock
> Make Collections.sort() more efficient in FSParentQueue.java
> ------------------------------------------------------------
>
> Key: YARN-4090
> URL: https://issues.apache.org/jira/browse/YARN-4090
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: fairscheduler
> Reporter: Xianyin Xin
> Assignee: Xianyin Xin
> Attachments: YARN-4090-TestResult.pdf, YARN-4090-preview.patch,
> YARN-4090.001.patch, YARN-4090.002.patch, YARN-4090.003.patch, sampling1.jpg,
> sampling2.jpg
>
>
> Collections.sort() consumes too much time in a scheduling round.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]