[
https://issues.apache.org/jira/browse/YARN-9769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
徐鹏 updated YARN-9769:
---------------------
Attachment: image-2019-08-21-16-33-55-739.png
> if "ContainerLocalizer Downloader" thread block ,it will never stop
> -------------------------------------------------------------------
>
> Key: YARN-9769
> URL: https://issues.apache.org/jira/browse/YARN-9769
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager
> Affects Versions: 2.5.0
> Environment: hadoop:2.5.0-cdh5.2.0
> Reporter: 徐鹏
> Priority: Major
> Attachments: image-2019-08-21-16-31-23-444.png,
> image-2019-08-21-16-31-58-374.png, image-2019-08-21-16-32-07-920.png,
> image-2019-08-21-16-32-39-938.png, image-2019-08-21-16-33-07-159.png,
> image-2019-08-21-16-33-55-739.png, nm_jstack
>
>
> If "ContainerLocalizer Downloader" thread block ,it will never stop and
> nodemanger jvm will run out of memory .NodeManager should fail
> "ContainerLocalizer Downloader" thread by timeout.
>
> In my case:
> *NM jvm main opt*: -
> -XX:InitialHeapSize=2147483648 -XX:MaxGCPauseMillis=200
> -XX:MaxHeapSize=2147483648 -XX:MaxNewSize=1287651328
> -XX:MinHeapDeltaBytes=1048576 - -XX:+UseG1GC
> *gc* : frequently but work bad (old gen >= 99%)
>
> !image-2019-08-21-16-31-23-444.png!
> *jstack&jmap*: 3602 "ContainerLocalizer Downloader" threads block
> ,total 561MB
>
> {code:java}
> // code placeholder"ContainerLocalizer Downloader" #59288379 prio=5 os_prio=0
> tid=0x00007f9c62d9d800 nid=0xb7550 waiting on condition
> [0x00007f9b1c2c0000]"ContainerLocalizer Downloader" #59288379 prio=5
> os_prio=0 tid=0x00007f9c62d9d800 nid=0xb7550 waiting on condition
> [0x00007f9b1c2c0000] java.lang.Thread.State: WAITING (parking) at
> sun.misc.Unsafe.park(Native Method) - parking to wait for
> <0x000000008057ddb0> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
> at
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:254)
> at
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:432)
> at
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:449)
> at
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> at
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:394)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:305)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:590)
> - locked <0x00000000fa4ce540> (a org.apache.hadoop.hdfs.DFSInputStream) at
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:797)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:844) -
> locked <0x00000000fa4ce540> (a org.apache.hadoop.hdfs.DFSInputStream) at
> java.io.DataInputStream.read(DataInputStream.java:100) at
> org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:78) at
> org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:52) at
> org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112) at
> org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366) at
> org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:264) at
> org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:60) at
> org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:356) at
> org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:354) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:422) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1701)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:354) at
> org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) at
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>
> !image-2019-08-21-16-31-58-374.png!
> !image-2019-08-21-16-32-07-920.png!
> *ContainerLocalizer.class*
>
> !image-2019-08-21-16-33-07-159.png!
> *ADD Loop termination*
> *!image-2019-08-21-16-33-55-739.png!*
> [^nm_jstack]
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]