[ 
https://issues.apache.org/jira/browse/YARN-10221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294988#comment-17294988
 ] 

zhengchenyu commented on YARN-10221:
------------------------------------

Please follow YARN-10642 which explain the reason and solve this problem.

> Nodemanager lockups on printEventQueueDetails
> ---------------------------------------------
>
>                 Key: YARN-10221
>                 URL: https://issues.apache.org/jira/browse/YARN-10221
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.2.1
>         Environment: We're running stock hadoop3.2.1 with cgroups / 
> LinuxContainerExecutor.
> Java version:
> {noformat}
> openjdk version "1.8.0_242"
> OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~16.04-b08)
> OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode) {noformat}
>  
>            Reporter: Jon Bender
>            Assignee: Qi Zhu
>            Priority: Major
>
> We are seeing a rare, but critical bug on our production clusters running 
> hadoop 3.2.1. The central issue is that the NodeManager is locked up trying 
> to print details about the event queues. This feature was added in YARN-8995
> The main symptoms are:
> - Containers stuck in an Initing phase (ContainersIniting in jmx)
> - NM stops accepting RPC calls
> Failed job submissions manifest as socket timeouts to the RPC port:
> {code}
> INFO - diagnostics: Application application_1585693823779_0028 failed 1 times 
> (global limit =2; local limit is =1) due to Error launching 
> appattempt_1585693823779_0028_000001. Got exception: 
> java.net.SocketTimeoutException: Call From 
> hadoopresourcesec--0c94ac2238c29f40e.production/10.68.12.37 to 
> hadoopdatanodei--06bad095f795f0725.production:8039 failed on socket timeout 
> exception: java.net.SocketTimeoutException: 60000 millis timeout while 
> waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.68.12.37:59892 
> remote=hadoopdatanodei--06bad095f795f0725.production/10.68.58.224:8039]; For 
> more details see:  http://wiki.apache.org/hadoop/SocketTimeout
> {code}
> Relevant outputs from {{jstack -l:}} on an affected NodeManager. All IPC 
> threads are blocked waiting on the lock on the eventQueue
> Thread printing event queue details - this runs indefinitely
> {code:java}
> "Public Localizer" #62 prio=5 os_prio=0 tid=0x00007f488d948000 nid=0x1cee9 
> runnable [0x00007f4890571000]"Public Localizer" #62 prio=5 os_prio=0 
> tid=0x00007f488d948000 nid=0x1cee9 runnable [0x00007f4890571000]   
> java.lang.Thread.State: RUNNABLE at 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) 
> at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) 
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
>  at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource$FetchSuccessTransition.transition(LocalizedResource.java:252)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource$FetchSuccessTransition.transition(LocalizedResource.java:243)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>  - locked <0x00007f4906f49230> (a 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.handle(LocalizedResource.java:200)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:188)
>  - locked <0x00007f48f47a9658> (a 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:59)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:982)
> Locked ownable synchronizers: - <0x00007f48f5a7a950> (a 
> java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x00007f48f5a7a9a8> 
> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) - 
> <0x00007f4909f25278> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> {code}
> Sample IPC handler thread (8039 is our NM RPC port). All threads waiting on 
> 0x00007f48f5a7a9a8
> {code:java}
> "IPC Server handler 19 on default port 8039" #230 daemon prio=5 os_prio=0 
> tid=0x00007f488d8e2800 nid=0x1cede waiting on condition 
> [0x00007f489107b000]"IPC Server handler 19 on default port 8039" #230 daemon 
> prio=5 os_prio=0 tid=0x00007f488d8e2800 nid=0x1cede waiting on condition 
> [0x00007f489107b000]   java.lang.Thread.State: WAITING (parking) at 
> sun.misc.Unsafe.park(Native Method) - parking to wait for  
> <0x00007f48f5a7a9a8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) 
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222)
>  at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
>  at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:304)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.sendKillEvent(ContainerImpl.java:1030)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainerInternal(ContainerManagerImpl.java:1439)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainers(ContainerManagerImpl.java:1411)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.stopContainers(ContainerManagementProtocolPBServiceImpl.java:115)
>  at 
> org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:225)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
>    Locked ownable synchronizers: - None
> {code}
>  
> Single thread waiting on 0x00007f489016f000
> {code:java}
> "NM ContainerManager dispatcher" #243 prio=5 os_prio=0 tid=0x00007f488d145000 
> nid=0x1ceec waiting on condition [0x00007f489016f000]
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00007f48f5a7a950> (a 
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222)
>       at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
>       at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:125)
>       at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers:
>       - None
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to