Jon Bender created YARN-10221:
---------------------------------
Summary: Nodemanager lockups on printEventQueueDetails
Key: YARN-10221
URL: https://issues.apache.org/jira/browse/YARN-10221
Project: Hadoop YARN
Issue Type: Bug
Affects Versions: 3.2.1
Environment: We're running stock hadoop3.2.1 with cgroups /
LinuxContainerExecutor.
Java version:
{noformat}
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~16.04-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode) {noformat}
Reporter: Jon Bender
We are seeing a rare, but critical bug on our production clusters running
hadoop 3.2.1. The central issue is that the NodeManager is locked up trying to
print details about the event queues. This feature was added in YARN-8995
The main symptoms are:
- Containers stuck in an Initing phase (ContainersIniting in jmx)
- NM stops accepting RPC calls
Failed job submissions manifest as socket timeouts to the RPC port:
{code}
INFO - diagnostics: Application application_1585693823779_0028 failed 1 times
(global limit =2; local limit is =1) due to Error launching
appattempt_1585693823779_0028_000001. Got exception:
java.net.SocketTimeoutException: Call From
hadoopresourcesec--0c94ac2238c29f40e.production/10.68.12.37 to
hadoopdatanodei--06bad095f795f0725.production:8039 failed on socket timeout
exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.68.12.37:59892
remote=hadoopdatanodei--06bad095f795f0725.production/10.68.58.224:8039]; For
more details see: http://wiki.apache.org/hadoop/SocketTimeout
{code}
Relevant outputs from {{jstack -l:}} on an affected NodeManager. All IPC
threads are blocked waiting on the lock on the eventQueue
Thread printing event queue details - this runs indefinitely
{code:java}
"Public Localizer" #62 prio=5 os_prio=0 tid=0x00007f488d948000 nid=0x1cee9
runnable [0x00007f4890571000]"Public Localizer" #62 prio=5 os_prio=0
tid=0x00007f488d948000 nid=0x1cee9 runnable [0x00007f4890571000]
java.lang.Thread.State: RUNNABLE at
java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource$FetchSuccessTransition.transition(LocalizedResource.java:252)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource$FetchSuccessTransition.transition(LocalizedResource.java:243)
at
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
- locked <0x00007f4906f49230> (a
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.handle(LocalizedResource.java:200)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:188)
- locked <0x00007f48f47a9658> (a
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:59)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:982)
Locked ownable synchronizers: - <0x00007f48f5a7a950> (a
java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x00007f48f5a7a9a8> (a
java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x00007f4909f25278> (a
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
{code}
Sample IPC handler thread (8039 is our NM RPC port). All threads waiting on
0x00007f48f5a7a9a8
{code:java}
"IPC Server handler 19 on default port 8039" #230 daemon prio=5 os_prio=0
tid=0x00007f488d8e2800 nid=0x1cede waiting on condition
[0x00007f489107b000]"IPC Server handler 19 on default port 8039" #230 daemon
prio=5 os_prio=0 tid=0x00007f488d8e2800 nid=0x1cede waiting on condition
[0x00007f489107b000] java.lang.Thread.State: WAITING (parking) at
sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f48f5a7a9a8>
(a java.util.concurrent.locks.ReentrantLock$NonfairSync) at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222)
at
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:304)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.sendKillEvent(ContainerImpl.java:1030)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainerInternal(ContainerManagerImpl.java:1439)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainers(ContainerManagerImpl.java:1411)
at
org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.stopContainers(ContainerManagementProtocolPBServiceImpl.java:115)
at
org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:225)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
Locked ownable synchronizers: - None
{code}
Single thread waiting on 0x00007f489016f000
{code:java}
"NM ContainerManager dispatcher" #243 prio=5 os_prio=0 tid=0x00007f488d145000
nid=0x1ceec waiting on condition [0x00007f489016f000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f48f5a7a950> (a
java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222)
at
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:125)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- None
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]