[ 
https://issues.apache.org/jira/browse/YARN-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kawa updated YARN-1422:
----------------------------

    Priority: Critical  (was: Major)

> RM CapacityScheduler can deadlock when getQueueUserAclInfo() is called and a 
> container is completing
> ----------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1422
>                 URL: https://issues.apache.org/jira/browse/YARN-1422
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler, resourcemanager
>    Affects Versions: 2.2.0
>            Reporter: Adam Kawa
>            Priority: Critical
>
> If getQueueUserAclInfo() on a parent/root queue (e.g. via 
> CapacityScheduler.getQueueUserAclInfo) is called, and a container is 
> completing, then the ResourceManager can deadlock. 
> It is similar to https://issues.apache.org/jira/browse/YARN-325. 
> *More details:*
> * Thread A
> 1) In a synchronized block of code (a lockid 
> 0x00000000c18d8870=LeafQueue.class), LeafQueue.completedContainer wants to 
> inform the parent queue that a container is being completed and invokes 
> ParentQueue.completedContainer method.
> 3) The ParentQueue.completedContainer waits to aquire a lock on itself (a 
> lockid 0x00000000c1846350=ParentQueue.class) to go to synchronized block of 
> code. It can not accuire this lock, because Thread B already has this lock.
> * Thread B
> 0) A moment earlier, CapacityScheduler.getQueueUserAclInfo is called. This 
> method invokes a synchronized method on ParentQueue.class i.e. 
> ParentQueue.getQueueUserAclInfo (a lockid 
> 0x00000000c1846350=ParentQueue.class) and aquires the lock that Thread A will 
> be waiting for. 
> 2) Unluckyly, ParentQueue.getQueueUserAclInfo iterates over children queue 
> acls and it wants to run a synchonized method, LeafQueue.getQueueUserAclInfo, 
> but it does not have a lock on LeafQueue.class (a lockid 
> 0x00000000c18d8870=LeafQueue.class). This lock is already held by 
> LeafQueue.completedContainer in Thread A.
> The order that causes the deadlock: B0 -> A1 -> B2 -> A3.
> *Java Stacktrace*
> {code}
> Found one Java-level deadlock:
> =============================
> "1956747953@qtp-109760451-1959":
>   waiting to lock monitor 0x00000000434e10c8 (object 0x00000000c1846350, a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue),
>   which is held by "IPC Server handler 39 on 8032"
> "IPC Server handler 39 on 8032":
>   waiting to lock monitor 0x00000000422bbc58 (object 0x00000000c18d8870, a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue),
>   which is held by "ResourceManager Event Processor"
> "ResourceManager Event Processor":
>   waiting to lock monitor 0x00000000434e10c8 (object 0x00000000c1846350, a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue),
>   which is held by "IPC Server handler 39 on 8032"
> Java stack information for the threads listed above:
> ===================================================
> "1956747953@qtp-109760451-1959":
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getUsedCapacity(ParentQueue.java:276)
>       - waiting to lock <0x00000000c1846350> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerInfo.<init>(CapacitySchedulerInfo.java:49)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:203)
>       at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
>       at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
>       at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>       at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
>       at 
> org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
>       at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845)
>       at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56)
>       at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
>       at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:76)
>       at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
>       at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>       at 
> com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
>       at 
> com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
>       at 
> com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
>       at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
>       at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
>       at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
>       at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
>       at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>       at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>       at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>       at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
>       at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>       at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
>       at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>       at 
> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081)
>       at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>       at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>       at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>       at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>       at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>       at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>       at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>       at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>       at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>       at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>       at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>       at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>       at org.mortbay.jetty.Server.handle(Server.java:326)
>       at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>       at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
>       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
>       at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>       at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>       at 
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
>       at 
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> "IPC Server handler 39 on 8032":
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.getQueueUserAclInfo(LeafQueue.java:544)
>       - waiting to lock <0x00000000c18d8870> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueUserAclInfo(ParentQueue.java:351)
>       - locked <0x00000000c1846350> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getQueueUserAclInfo(CapacityScheduler.java:622)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:517)
>       at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueUserAcls(ApplicationClientProtocolPBServiceImpl.java:225)
>       at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:255)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
> "ResourceManager Event Processor":
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.completedContainer(ParentQueue.java:693)
>       - waiting to lock <0x00000000c1846350> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1460)
>       - locked <0x00000000c18d8870> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:838)
>       - locked <0x00000000c1846310> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:648)
>       - locked <0x00000000c1846310> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:734)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:86)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
>       at java.lang.Thread.run(Thread.java:662)
> Found 1 deadlock.
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to