[ https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695578#comment-15695578 ]
zhengchenyu edited comment on YARN-4090 at 11/25/16 11:25 AM: -------------------------------------------------------------- here we see a dead block: "IPC Server handler 98 on 8032" is waiting for lock (0x00007f42e17a5ed8) "IPC Server handler 76 on 8032" got the lock (0x00007f42e17a5ed8), is is waiting for lock (0x00007f42df3e8450) "ResourceManager Event Processor" got the lock (0x00007f42df3e8450),is waiting for lock (0x00007f42e17a5ed8) In fact, 0x00007f42e17a5ed8 is a object lock of FSParentQueue, here I called this root.Parent. 0x00007f42df3e8450 is another object lock of FSParentQueue, this is the child queue object of 0x00007f42e17a5ed8. here I called this root.Parent.Child. Let's trace these thread. (1) ResourceManager Event Processor {code} FairScheduler.handle FairScheduler.nodeUpdate FairScheduler.completedContainer FSAppAttempt.containerCompleted FSLeafQueue.decResourceUsage //got the lock 0x00007f42e0c7cf50 FSParentQueue.decResourceUsage //got the lock 0x00007f42df3e8450 which is the object lock of root.Parent.Child FSParentQueue.decResourceUsage //wait for 0x00007f42e17a5ed8 which is the object lock of root.Parent {code} (2) IPC Server handler 76 on 8032 {code} ClientRMService.getQueueUserAcls FairScheduler.getQueueUserAclInfo FSParentQueue.getQueueUserAclInfo //got the lock 0x00007f42e17a5ed8 FSParentQueue.getQueueUserAclInfo //wait for the lock 0x00007f42df3e8450 {code} The left thread is unnecessary to analyse. Here we can see decResourceUsage got the object lock from bottom to top, but getQueueUserAcls got the object lock from top to bottom. getQueueUserAcls got the object lock of root and root.Parent, and waits for root.Parent.Child. But decResourceUsage got the object lock of root.Parent.Child, and waits for root.Parnt. That's a deadlock. I recommend that decResourceUsage is rewriten with the way of getting the object lock from top to bottom. Another way is that choosing ReadWriteLock the take the place of object lock was (Author: zhengchenyu): here we see a dead block: "IPC Server handler 98 on 8032" is waiting for lock (0x00007f42e17a5ed8) "IPC Server handler 76 on 8032" got the lock (0x00007f42e17a5ed8), is is waiting for lock (0x00007f42df3e8450) "ResourceManager Event Processor" got the lock (0x00007f42df3e8450),is waiting for lock (0x00007f42e17a5ed8) In fact, 0x00007f42e17a5ed8 is a object lock of FSParentQueue, here I called this root.Parent. 0x00007f42df3e8450 is another object lock of FSParentQueue, this is the child queue object of 0x00007f42e17a5ed8. here I called this root.Parent.Child. Let's trace these thread. (1) ResourceManager Event Processor {code} FairScheduler.handle FairScheduler.nodeUpdate FairScheduler.completedContainer FSAppAttempt.containerCompleted FSLeafQueue.decResourceUsage //got the lock 0x00007f42e0c7cf50 FSParentQueue.decResourceUsage //got the lock 0x00007f42df3e8450 which is the object lock of root.Parent.Child FSParentQueue.decResourceUsage //wait for 0x00007f42e17a5ed8 which is the object lock of root.Parent {code} (2) IPC Server handler 76 on 8032 {code} ClientRMService.getQueueUserAcls FairScheduler.getQueueUserAclInfo FSParentQueue.getQueueUserAclInfo //got the lock 0x00007f42e17a5ed8 FSParentQueue.getQueueUserAclInfo //wait for the lock 0x00007f42df3e8450 {code} The left thread is unnecessary to analyse. Here we can see decResourceUsage got the object lock from bottom to top, but getQueueUserAcls got the object lock from top to bottom. getQueueUserAcls got the object lock of root and root.Parent, and waits for root.Parent.Child. But decResourceUsage got the object lock of root.Parent.Child, and waits for root.Parnt. That's a deadlock. I recommend that decResourceUsage is rewriten with the way of getting the object lock from top to bottom. > Make Collections.sort() more efficient in FSParentQueue.java > ------------------------------------------------------------ > > Key: YARN-4090 > URL: https://issues.apache.org/jira/browse/YARN-4090 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler > Reporter: Xianyin Xin > Assignee: Xianyin Xin > Attachments: YARN-4090-TestResult.pdf, YARN-4090-preview.patch, > YARN-4090.001.patch, YARN-4090.002.patch, YARN-4090.003.patch, sampling1.jpg, > sampling2.jpg > > > Collections.sort() consumes too much time in a scheduling round. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org