徐鹏 created YARN-9769:
------------------------
Summary: if "ContainerLocalizer Downloader" thread block ,it will
never stop
Key: YARN-9769
URL: https://issues.apache.org/jira/browse/YARN-9769
Project: Hadoop YARN
Issue Type: Improvement
Components: nodemanager
Affects Versions: 2.5.0
Environment: hadoop:2.5.0-cdh5.2.0
Reporter: 徐鹏
Attachments: nm_jstack
If "ContainerLocalizer Downloader" thread block ,it will never stop and
nodemanger jvm will run out of memory .NodeManager should fail
"ContainerLocalizer Downloader" thread by timeout.
In my case:
*NM jvm main opt*: -
-XX:InitialHeapSize=2147483648 -XX:MaxGCPauseMillis=200
-XX:MaxHeapSize=2147483648 -XX:MaxNewSize=1287651328
-XX:MinHeapDeltaBytes=1048576 - -XX:+UseG1GC
*gc* : frequently but work bad (old gen >= 99%)
!image-2019-08-20-23-39-23-968.png!
*jstack&jmap*: 3602 "ContainerLocalizer Downloader" threads block ,total
561MB
{code:java}
// code placeholder"ContainerLocalizer Downloader" #59288379 prio=5 os_prio=0
tid=0x00007f9c62d9d800 nid=0xb7550 waiting on condition
[0x00007f9b1c2c0000]"ContainerLocalizer Downloader" #59288379 prio=5 os_prio=0
tid=0x00007f9c62d9d800 nid=0xb7550 waiting on condition [0x00007f9b1c2c0000]
java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native
Method) - parking to wait for <0x000000008057ddb0> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
at
org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:254)
at
org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:432)
at
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
at
org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:449)
at
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
at
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
at
org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:394)
at
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:305) at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:590) -
locked <0x00000000fa4ce540> (a org.apache.hadoop.hdfs.DFSInputStream) at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:797)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:844) - locked
<0x00000000fa4ce540> (a org.apache.hadoop.hdfs.DFSInputStream) at
java.io.DataInputStream.read(DataInputStream.java:100) at
org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:78) at
org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:52) at
org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112) at
org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366) at
org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:264) at
org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:60) at
org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:356) at
org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:354) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1701)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:354) at
org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
!image-2019-08-21-15-18-09-514.png!
!image-2019-08-21-15-18-27-553.png!
*ContainerLocalizer.class*
!image-2019-08-21-16-21-01-610.png!
*ADD Loop termination*
!image-2019-08-21-16-21-23-037.png![^nm_jstack]
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]