Prabhu Joseph created YARN-7426:
-----------------------------------
Summary: Add a finite shell command timeout to ContainerLocalizer
Key: YARN-7426
URL: https://issues.apache.org/jira/browse/YARN-7426
Project: Hadoop YARN
Issue Type: Bug
Components: yarn
Affects Versions: 2.7.3
Reporter: Prabhu Joseph
Priority: Critical
When the NodeManager is overloaded and ContainerLocalizer processes are
hanging, the containers will timeout and cleaned up. The LocalizerRunner thread
will be interrupted during cleanup but the interrupt does not work when it is
reading from FileInputStream. LocalizerRunner threads and ContainerLocalizer
process keeps on accumulating which makes the node completely unresponsive. We
can have a timeout for Shell Command to avoid this similar to HADOOP-13817.
The timeout value can be set by AM same as container timeout.
ContainerLocalizer JVM stacktrace:
{code}
"main" #1 prio=5 os_prio=0 tid=0x00007fd8ec019000 nid=0xc295 runnable
[0x00007fd8f3956000]
java.lang.Thread.State: RUNNABLE
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(ZipFile.java:219)
at java.util.zip.ZipFile.<init>(ZipFile.java:149)
at java.util.jar.JarFile.<init>(JarFile.java:166)
at java.util.jar.JarFile.<init>(JarFile.java:103)
at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:893)
at sun.misc.URLClassPath$JarLoader.access$700(URLClassPath.java:756)
at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:838)
at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:831)
at java.security.AccessController.doPrivileged(Native Method)
at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:830)
at sun.misc.URLClassPath$JarLoader.<init>(URLClassPath.java:803)
at sun.misc.URLClassPath$3.run(URLClassPath.java:530)
at sun.misc.URLClassPath$3.run(URLClassPath.java:520)
at java.security.AccessController.doPrivileged(Native Method)
at sun.misc.URLClassPath.getLoader(URLClassPath.java:519)
at sun.misc.URLClassPath.getLoader(URLClassPath.java:492)
- locked <0x000000076ac75058> (a sun.misc.URLClassPath)
at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:457)
- locked <0x000000076ac75058> (a sun.misc.URLClassPath)
at sun.misc.URLClassPath.getResource(URLClassPath.java:211)
at java.net.URLClassLoader$1.run(URLClassLoader.java:365)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
- locked <0x000000076ac7f960> (a java.lang.Object)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:495)
{code}
NodeManager LocalizerRunner thread which is not interrupted:
{code}
"LocalizerRunner for container_e746_1508665985104_601806_01_000005" #3932753
prio=5 os_prio=0 tid=0x00007fb258d5f800 nid=0x11091 runnable
[0x00007fb153946000]
java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:255)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
- locked <0x0000000718502b80> (a
java.lang.UNIXProcess$ProcessPipeInputStream)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
- locked <0x0000000718502bd8> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.read1(BufferedReader.java:212)
at java.io.BufferedReader.read(BufferedReader.java:286)
- locked <0x0000000718502bd8> (a java.io.InputStreamReader)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1155)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:930)
at org.apache.hadoop.util.Shell.run(Shell.java:848)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1142)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:151)
at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:264)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
NM log shows the LocalizerRunner is suppose to
{code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]