Adam Kawa created YARN-1422: ------------------------------- Summary: RM CapacityScheduler can deadlock when getQueueUserAclInfo() is called and a container is completing Key: YARN-1422 URL: https://issues.apache.org/jira/browse/YARN-1422 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.2.0 Reporter: Adam Kawa
If getQueueUserAclInfo() on a parent/root queue (e.g. via CapacityScheduler.getQueueUserAclInfo) is called, and a container is completing, then the ResourceManager can deadlock. It is similar to https://issues.apache.org/jira/browse/YARN-325. *More details:* * Thread A 1) In a synchronized block of code (a lockid 0x00000000c18d8870=LeafQueue.class), LeafQueue.completedContainer wants to inform the parent queue that a container is being completed and invokes ParentQueue.completedContainer method. 3) The ParentQueue.completedContainer waits to aquire a lock on itself (a lockid 0x00000000c1846350=ParentQueue.class) to go to synchronized block of code. It can not accuire this lock, because Thread B already has this lock. * Thread B 0) A moment earlier, CapacityScheduler.getQueueUserAclInfo is called. This method invokes a synchronized method on ParentQueue.class i.e. ParentQueue.getQueueUserAclInfo (a lockid 0x00000000c1846350=ParentQueue.class) and aquires the lock that Thread A will be waiting for. 2) Unluckyly, ParentQueue.getQueueUserAclInfo iterates over children queue acls and it wants to run a synchonized method, LeafQueue.getQueueUserAclInfo, but it does not have a lock on LeafQueue.class (a lockid 0x00000000c18d8870=LeafQueue.class). This lock is already held by LeafQueue.completedContainer in Thread A. The order that causes the deadlock: B0 -> A1 -> B2 -> A3. *Java Stacktrace* {code} Found one Java-level deadlock: ============================= "1956747953@qtp-109760451-1959": waiting to lock monitor 0x00000000434e10c8 (object 0x00000000c1846350, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue), which is held by "IPC Server handler 39 on 8032" "IPC Server handler 39 on 8032": waiting to lock monitor 0x00000000422bbc58 (object 0x00000000c18d8870, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue), which is held by "ResourceManager Event Processor" "ResourceManager Event Processor": waiting to lock monitor 0x00000000434e10c8 (object 0x00000000c1846350, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue), which is held by "IPC Server handler 39 on 8032" Java stack information for the threads listed above: =================================================== "1956747953@qtp-109760451-1959": at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getUsedCapacity(ParentQueue.java:276) - waiting to lock <0x00000000c1846350> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerInfo.<init>(CapacitySchedulerInfo.java:49) at org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:203) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:76) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) "IPC Server handler 39 on 8032": at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.getQueueUserAclInfo(LeafQueue.java:544) - waiting to lock <0x00000000c18d8870> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueUserAclInfo(ParentQueue.java:351) - locked <0x00000000c1846350> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getQueueUserAclInfo(CapacityScheduler.java:622) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:517) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueUserAcls(ApplicationClientProtocolPBServiceImpl.java:225) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:255) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047) "ResourceManager Event Processor": at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.completedContainer(ParentQueue.java:693) - waiting to lock <0x00000000c1846350> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1460) - locked <0x00000000c18d8870> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:838) - locked <0x00000000c1846310> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:648) - locked <0x00000000c1846310> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:734) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:86) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:662) Found 1 deadlock. {code} -- This message was sent by Atlassian JIRA (v6.1#6144)