[ https://issues.apache.org/jira/browse/YARN-10221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294988#comment-17294988 ]
zhengchenyu commented on YARN-10221: ------------------------------------ Please follow YARN-10642 which explain the reason and solve this problem. > Nodemanager lockups on printEventQueueDetails > --------------------------------------------- > > Key: YARN-10221 > URL: https://issues.apache.org/jira/browse/YARN-10221 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 3.2.1 > Environment: We're running stock hadoop3.2.1 with cgroups / > LinuxContainerExecutor. > Java version: > {noformat} > openjdk version "1.8.0_242" > OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~16.04-b08) > OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode) {noformat} > > Reporter: Jon Bender > Assignee: Qi Zhu > Priority: Major > > We are seeing a rare, but critical bug on our production clusters running > hadoop 3.2.1. The central issue is that the NodeManager is locked up trying > to print details about the event queues. This feature was added in YARN-8995 > The main symptoms are: > - Containers stuck in an Initing phase (ContainersIniting in jmx) > - NM stops accepting RPC calls > Failed job submissions manifest as socket timeouts to the RPC port: > {code} > INFO - diagnostics: Application application_1585693823779_0028 failed 1 times > (global limit =2; local limit is =1) due to Error launching > appattempt_1585693823779_0028_000001. Got exception: > java.net.SocketTimeoutException: Call From > hadoopresourcesec--0c94ac2238c29f40e.production/10.68.12.37 to > hadoopdatanodei--06bad095f795f0725.production:8039 failed on socket timeout > exception: java.net.SocketTimeoutException: 60000 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.68.12.37:59892 > remote=hadoopdatanodei--06bad095f795f0725.production/10.68.58.224:8039]; For > more details see: http://wiki.apache.org/hadoop/SocketTimeout > {code} > Relevant outputs from {{jstack -l:}} on an affected NodeManager. All IPC > threads are blocked waiting on the lock on the eventQueue > Thread printing event queue details - this runs indefinitely > {code:java} > "Public Localizer" #62 prio=5 os_prio=0 tid=0x00007f488d948000 nid=0x1cee9 > runnable [0x00007f4890571000]"Public Localizer" #62 prio=5 os_prio=0 > tid=0x00007f488d948000 nid=0x1cee9 runnable [0x00007f4890571000] > java.lang.Thread.State: RUNNABLE at > java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource$FetchSuccessTransition.transition(LocalizedResource.java:252) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource$FetchSuccessTransition.transition(LocalizedResource.java:243) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > - locked <0x00007f4906f49230> (a > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.handle(LocalizedResource.java:200) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:188) > - locked <0x00007f48f47a9658> (a > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:59) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:982) > Locked ownable synchronizers: - <0x00007f48f5a7a950> (a > java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x00007f48f5a7a9a8> > (a java.util.concurrent.locks.ReentrantLock$NonfairSync) - > <0x00007f4909f25278> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > {code} > Sample IPC handler thread (8039 is our NM RPC port). All threads waiting on > 0x00007f48f5a7a9a8 > {code:java} > "IPC Server handler 19 on default port 8039" #230 daemon prio=5 os_prio=0 > tid=0x00007f488d8e2800 nid=0x1cede waiting on condition > [0x00007f489107b000]"IPC Server handler 19 on default port 8039" #230 daemon > prio=5 os_prio=0 tid=0x00007f488d8e2800 nid=0x1cede waiting on condition > [0x00007f489107b000] java.lang.Thread.State: WAITING (parking) at > sun.misc.Unsafe.park(Native Method) - parking to wait for > <0x00007f48f5a7a9a8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:304) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.sendKillEvent(ContainerImpl.java:1030) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainerInternal(ContainerManagerImpl.java:1439) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainers(ContainerManagerImpl.java:1411) > at > org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.stopContainers(ContainerManagementProtocolPBServiceImpl.java:115) > at > org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:225) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915) > Locked ownable synchronizers: - None > {code} > > Single thread waiting on 0x00007f489016f000 > {code:java} > "NM ContainerManager dispatcher" #243 prio=5 os_prio=0 tid=0x00007f488d145000 > nid=0x1ceec waiting on condition [0x00007f489016f000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00007f48f5a7a950> (a > java.util.concurrent.locks.ReentrantLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:125) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - None > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org