[
https://issues.apache.org/jira/browse/YARN-10221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Qi Zhu reassigned YARN-10221:
-----------------------------
Assignee: Qi Zhu
> Nodemanager lockups on printEventQueueDetails
> ---------------------------------------------
>
> Key: YARN-10221
> URL: https://issues.apache.org/jira/browse/YARN-10221
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 3.2.1
> Environment: We're running stock hadoop3.2.1 with cgroups /
> LinuxContainerExecutor.
> Java version:
> {noformat}
> openjdk version "1.8.0_242"
> OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~16.04-b08)
> OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode) {noformat}
>
> Reporter: Jon Bender
> Assignee: Qi Zhu
> Priority: Major
>
> We are seeing a rare, but critical bug on our production clusters running
> hadoop 3.2.1. The central issue is that the NodeManager is locked up trying
> to print details about the event queues. This feature was added in YARN-8995
> The main symptoms are:
> - Containers stuck in an Initing phase (ContainersIniting in jmx)
> - NM stops accepting RPC calls
> Failed job submissions manifest as socket timeouts to the RPC port:
> {code}
> INFO - diagnostics: Application application_1585693823779_0028 failed 1 times
> (global limit =2; local limit is =1) due to Error launching
> appattempt_1585693823779_0028_000001. Got exception:
> java.net.SocketTimeoutException: Call From
> hadoopresourcesec--0c94ac2238c29f40e.production/10.68.12.37 to
> hadoopdatanodei--06bad095f795f0725.production:8039 failed on socket timeout
> exception: java.net.SocketTimeoutException: 60000 millis timeout while
> waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.68.12.37:59892
> remote=hadoopdatanodei--06bad095f795f0725.production/10.68.58.224:8039]; For
> more details see: http://wiki.apache.org/hadoop/SocketTimeout
> {code}
> Relevant outputs from {{jstack -l:}} on an affected NodeManager. All IPC
> threads are blocked waiting on the lock on the eventQueue
> Thread printing event queue details - this runs indefinitely
> {code:java}
> "Public Localizer" #62 prio=5 os_prio=0 tid=0x00007f488d948000 nid=0x1cee9
> runnable [0x00007f4890571000]"Public Localizer" #62 prio=5 os_prio=0
> tid=0x00007f488d948000 nid=0x1cee9 runnable [0x00007f4890571000]
> java.lang.Thread.State: RUNNABLE at
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource$FetchSuccessTransition.transition(LocalizedResource.java:252)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource$FetchSuccessTransition.transition(LocalizedResource.java:243)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> - locked <0x00007f4906f49230> (a
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.handle(LocalizedResource.java:200)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:188)
> - locked <0x00007f48f47a9658> (a
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:59)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:982)
> Locked ownable synchronizers: - <0x00007f48f5a7a950> (a
> java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x00007f48f5a7a9a8>
> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) -
> <0x00007f4909f25278> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> {code}
> Sample IPC handler thread (8039 is our NM RPC port). All threads waiting on
> 0x00007f48f5a7a9a8
> {code:java}
> "IPC Server handler 19 on default port 8039" #230 daemon prio=5 os_prio=0
> tid=0x00007f488d8e2800 nid=0x1cede waiting on condition
> [0x00007f489107b000]"IPC Server handler 19 on default port 8039" #230 daemon
> prio=5 os_prio=0 tid=0x00007f488d8e2800 nid=0x1cede waiting on condition
> [0x00007f489107b000] java.lang.Thread.State: WAITING (parking) at
> sun.misc.Unsafe.park(Native Method) - parking to wait for
> <0x00007f48f5a7a9a8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222)
> at
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
> at
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) at
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:304)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.sendKillEvent(ContainerImpl.java:1030)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainerInternal(ContainerManagerImpl.java:1439)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainers(ContainerManagerImpl.java:1411)
> at
> org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.stopContainers(ContainerManagementProtocolPBServiceImpl.java:115)
> at
> org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:225)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:422) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
> Locked ownable synchronizers: - None
> {code}
>
> Single thread waiting on 0x00007f489016f000
> {code:java}
> "NM ContainerManager dispatcher" #243 prio=5 os_prio=0 tid=0x00007f488d145000
> nid=0x1ceec waiting on condition [0x00007f489016f000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00007f48f5a7a950> (a
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222)
> at
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
> at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:125)
> at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
> - None
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]