[
https://issues.apache.org/jira/browse/YARN-11501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17736986#comment-17736986
]
Bibin Chundatt edited comment on YARN-11501 at 6/26/23 5:10 AM:
----------------------------------------------------------------
[~prabhujoseph] Did a quick scan at the call stack..
at
org.apache.hadoop.yarn.server.resourcemanager.rmnode.*RMNodeImpl.getState(RMNodeImpl.java:619)*
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.updateMaxResources(ClusterNodeTracker.java:307)
I am not able to trace ClusterNodeTracker#updateMaxResources ->
RMNodeImpl.getState .. in trunk code . Any private change ??
was (Author: bibinchundatt):
[~prabhujoseph] Did a quick scan at the call stack.. Dont stack tracematching
with one from OSS
at
org.apache.hadoop.yarn.server.resourcemanager.rmnode.*RMNodeImpl.getState(RMNodeImpl.java:619)*
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.updateMaxResources(ClusterNodeTracker.java:307)
I am not able to trace ClusterNodeTracker#updateMaxResources ->
RMNodeImpl.getState .. in trunk code . Any private change ??
> ResourceManager deadlock due to
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers
> ------------------------------------------------------------------------------------------
>
> Key: YARN-11501
> URL: https://issues.apache.org/jira/browse/YARN-11501
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 3.4.0
> Reporter: Prabhu Joseph
> Assignee: Prabhu Joseph
> Priority: Critical
>
> We have seen a deadlock in ResourceManager due to
> StatusUpdateWhenHealthyTransition.hasScheduledAMContainers holding the lock
> on RMNode and waiting to lock SchedulerNode whereas
> CapacityScheduler#removeNode taken lock on SchedulerNode and waiting to lock
> RMNode.
> cc *Vishal Vyas*
>
> {code:java}
> Found one Java-level deadlock:
> =============================
> "qtp1401737458-850":
> waiting for ownable synchronizer 0x0000000717e6ff60, (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
> which is held by "RM Event dispatcher"
> "RM Event dispatcher":
> waiting for ownable synchronizer 0x00000007168a7a38, (a
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
> which is held by "SchedulerEventDispatcher:Event Processor"
> "SchedulerEventDispatcher:Event Processor":
> waiting for ownable synchronizer 0x0000000717e6ff60, (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
> which is held by "RM Event dispatcher"
> Java stack information for the threads listed above:
> ===================================================
> "qtp1401737458-850":
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x0000000717e6ff60> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
> at
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
> at
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.queryRMNodes(RMServerUtils.java:128)
> at
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:464)
> at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
> at
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
> at
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
> at
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
> at
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
> at
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
> at
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
> at
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
> at
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
> at
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
> at
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
> at
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
> at
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
> at
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
> at
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:927)
> at
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
> at
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:180)
> at
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
> at
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
> at
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
> at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
> at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
> at
> com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203)
> - locked <0x0000000791ad1fd0> (a
> com.google.inject.servlet.GuiceFilter$Context)
> at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130)
> at
> org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
> at
> org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
> at
> org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57)
> at
> org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
> at
> org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
> at
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:110)
> at
> org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
> at
> org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
> at
> org.apache.hadoop.security.http.CrossOriginFilter.doFilter(CrossOriginFilter.java:98)
> at
> org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
> at
> org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
> at
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1764)
> at
> org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
> at
> org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
> at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at
> org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
> at
> org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:600)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
> at org.eclipse.jetty.server.Server.handle(Server.java:516)
> at
> org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
> at
> org.eclipse.jetty.server.HttpChannel$$Lambda$114/1946743342.dispatch(Unknown
> Source)
> at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
> at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
> at
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
> at java.lang.Thread.run(Thread.java:750)
> "RM Event dispatcher":
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000007168a7a38> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
> at
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNode(ClusterNodeTracker.java:135)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getSchedulerNode(AbstractYarnScheduler.java:792)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl$StatusUpdateWhenHealthyTransition.hasScheduledAMContainers(RMNodeImpl.java:1427)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl$StatusUpdateWhenHealthyTransition.transition(RMNodeImpl.java:1372)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl$StatusUpdateWhenHealthyTransition.transition(RMNodeImpl.java:1342)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> - locked <0x0000000717e70928> (a
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:768)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:104)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:1267)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:1251)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:241)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:156)
> at java.lang.Thread.run(Thread.java:750)
> "SchedulerEventDispatcher:Event Processor":
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x0000000717e6ff60> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
> at
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.getState(RMNodeImpl.java:619)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.updateMaxResources(ClusterNodeTracker.java:307)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.lambda$updateMaxResources$0(ClusterNodeTracker.java:337)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker$$Lambda$131/1663466005.accept(Unknown
> Source)
> at java.util.HashMap$Values.forEach(HashMap.java:982)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.updateMaxResources(ClusterNodeTracker.java:337)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.removeNode(ClusterNodeTracker.java:220)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.removeNode(CapacityScheduler.java:2145)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1833)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
> at
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:83)
> at java.lang.Thread.run(Thread.java:750)
> Found 1 deadlock. {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]