[jira] [Updated] (YARN-10642) ResourceManager may keep stuck, because AsyncDispatcher's printEventQueueDetails method stuck in an endless loop

zhengchenyu (Jira) Sat, 20 Feb 2021 04:52:29 -0800


     [ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


zhengchenyu updated YARN-10642:
-------------------------------
    Description: 
In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
can't submit application. I got jstack info at second time, then found the 
reason.
I analyze all the jstack, I found many thread stuck because can't get 
LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
analytical process)
The reason is that one thread hold the putLock all the time, 
printEventQueueDetails will called forEachRemaining, then hold putLock and 
readLock. The AsyncDispatcher will stuck.

{code}
Thread 6526 (IPC Server handler 454 on default port 8030):
  State: RUNNABLE
  Blocked count: 29988
  Waited count: 2035029
  Stack:
    
java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
    java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
    java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
    java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
    java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
    
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
    
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
    
org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
    
org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
    
org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
    
org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
    
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
    
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
    
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
    
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
    org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
    org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
    org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
    java.security.AccessController.doPrivileged(Native Method)
{code}


I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
are called in different thread. 
YARN-8995 introduce printEventQueueDetails method, 
"eventQueue.stream().collect" will called forEachRemaining method.

Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that how 
to take()。Specical Node: The removed Node will point itself for help gc!!!
The key point code is in forEachRemaining, we see LBQSpliterator use 
forEachRemaining to visit all Node. But when got item value from Node, will 
release the lock. If at this time, take() will be called. 
The variable 'p' in forEachRemaining may point a Node which point itself, then 
forEachRemaining will be in dead loop. You can see it in "deadloop.png"

Let's see a simple uni-test, Let's forEachRemaining called more slow than take, 
the problem will reproduction。uni-test is MockForDeadLoop.java.

I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
"debugfornode.png"

  was:
In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
can't submit application. I got jstack info at second time, then found the 
reason.
I analyze all the jstack, I found many thread stuck because can't get 
LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
analytical process)
The reason is that one thread hold the putLock all the time, 
printEventQueueDetails will called forEachRemaining, then hold putLock and 
readLock. The AsyncDispatcher will stuck.

{code}
Thread 6526 (IPC Server handler 454 on default port 8030):
  State: RUNNABLE
  Blocked count: 29988
  Waited count: 2035029
  Stack:
    
java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
    java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
    java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
    java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
    java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
    
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
    
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
    
org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
    
org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
    
org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
    
org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
    
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
    
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
    
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
    
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
    org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
    org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
    org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
    java.security.AccessController.doPrivileged(Native Method)
{code}


I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
are called in different thread. 
YARN-8995 introduce printEventQueueDetails method, 
"eventQueue.stream().collect" will called forEachRemaining method.

Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that how 
to take()。Specical Node: The removed Node will point itself for help gc!!!
The key point code is in forEachRemaining, we see LBQSpliterator use 
forEachRemaining to visit all Node. But when got item value from Node, will 
release the lock. If at this time, take() will be called. 
The variable 'p' in forEachRemaining may point a Node which point itself, then 
forEachRemaining will be in dead loop.

Let's see a simple uni-test, Let's forEachRemaining called more slow than take, 
the problem will reproduction。uni-test is MockForDeadLoop.java.

I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
"debugfornode.png"


> ResourceManager may keep stuck, because AsyncDispatcher's 
> printEventQueueDetails method stuck in an endless loop
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-10642
>                 URL: https://issues.apache.org/jira/browse/YARN-10642
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.2.1
>            Reporter: zhengchenyu
>            Assignee: zhengchenyu
>            Priority: Critical
>             Fix For: 3.3.1, 3.2.3
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
>     
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
>     java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>     
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>     java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>     java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>     java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
>     
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
>     
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
>     
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
>     
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
>     
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
>     
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
>     
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
>     
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>     
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>     
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>     org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>     org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
>     org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
>     java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YARN-10642) ResourceManager may keep stuck, because AsyncDispatcher's printEventQueueDetails method stuck in an endless loop

Reply via email to