[
https://issues.apache.org/jira/browse/YARN-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15697409#comment-15697409
]
zhangyubiao commented on YARN-5188:
-----------------------------------
NI global references: 280
Found one Java-level deadlock:
=============================
"IPC Server handler 99 on 8032":
waiting to lock monitor 0x00007f9c6c3e1f58 (object 0x00007f9113c08d80, a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue),
which is held by "IPC Server handler 27 on 8032"
"IPC Server handler 27 on 8032":
waiting to lock monitor 0x0000000001b42518 (object 0x00007f9113c0aa08, a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue),
which is held by "ResourceManager Event Processor"
"ResourceManager Event Processor":
waiting to lock monitor 0x00007f9c6c3e1f58 (object 0x00007f9113c08d80, a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue),
which is held by "IPC Server handler 27 on 8032"
Java stack information for the threads listed above:
===================================================
"IPC Server handler 99 on 8032":
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:160)
- waiting to lock <0x00007f9113c08d80> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1518)
at
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:903)
at
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueUserAcls(ApplicationClientProtocolPBServiceImpl.java:280)
at
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:431)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)
"IPC Server handler 27 on 8032":
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:160)
- waiting to lock <0x00007f9113c0aa08> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:167)
- locked <0x00007f9113c08d80> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1518)
at
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:903)
at
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueUserAcls(ApplicationClientProtocolPBServiceImpl.java:280)
at
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:431)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)
"ResourceManager Event Processor":
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.decResourceUsage(FSQueue.java:82)
- waiting to lock <0x00007f9113c08d80> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.decResourceUsage(FSQueue.java:84)
- locked <0x00007f9113c0aa08> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.decResourceUsage(FSQueue.java:84)
- locked <0x00007f9113c0b060> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.containerCompleted(FSAppAttempt.java:154)
- locked <0x00007f9113c0b470> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:876)
- locked <0x00007f9112d22230> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:1034)
- locked <0x00007f9112d22230> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1245)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:120)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:680)
at java.lang.Thread.run(Thread.java:745)
Found 1 deadlock.
> FairScheduler performance bug
> -----------------------------
>
> Key: YARN-5188
> URL: https://issues.apache.org/jira/browse/YARN-5188
> Project: Hadoop YARN
> Issue Type: Bug
> Components: fairscheduler
> Affects Versions: 2.5.0
> Reporter: ChenFolin
> Attachments: YARN-5188-1.patch
>
>
> My Hadoop Cluster has recently encountered a performance problem. Details as
> Follows.
> There are two point which can cause this performance issue.
> 1: application sort before assign container at FSLeafQueue. TreeSet is not
> the best, Why not keep orderly ? and then we can use binary search to help
> keep orderly when a application's resource usage has changed.
> 2: queue sort and assignContainerPreCheck will lead to compute all leafqueue
> resource usage ,Why can we store the leafqueue usage at memory and update it
> when assign container op release container happen?
>
> The efficiency of assign container in the Resourcemanager may fall
> when the number of running and pending application grows. And the fact is the
> cluster has too many PendingMB or PengdingVcore , and the Cluster
> current utilization rate may below 20%.
> I checked the resourcemanager logs, I found that every assign
> container may cost 5 ~ 10 ms, but just 0 ~ 1 ms at usual time.
>
> I use TestFairScheduler to reproduce the scene:
>
> Just one queue: root.defalut
> 10240 apps.
>
> assign container avg time: 6753.9 us ( 6.7539 ms)
> apps sort time (FSLeafQueue : Collections.sort(runnableApps,
> comparator); ): 4657.01 us ( 4.657 ms )
> compute LeafQueue Resource usage : 905.171 us ( 0.905171 ms )
>
> When just root.default, one assign container op contains : ( one apps
> sort op ) + 2 * ( compute leafqueue usage op )
> According to the above situation, I think the assign container op has
> a performance problem .
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]