KWON BYUNGCHANG created YARN-11959:
--------------------------------------

             Summary: NodeManager becomes unhealthy when container exits with 
code 22 or 24
                 Key: YARN-11959
                 URL: https://issues.apache.org/jira/browse/YARN-11959
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: KWON BYUNGCHANG


When a user container exits with code 22 or 24, the NodeManager becomes 
unhealthy and no more containers are allocated to that node. This situation can 
be resolved by restarting the NodeManager.

 
 
It can be reproduced immediately by running Scala Spark wordcount job that 
exits with code 22.
 
 
I propose to fix this by wrapping exit code 22 or 24 with different exit code, 
so that ConfigurationException that causes NodeManager to become unhealthy is 
not triggered.
 
{noformat}
2024-09-23 18:50:14,360 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(532)) - Obtaining the exit code...
2024-09-23 18:50:14,360 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(532)) - Docker inspect command: 
/usr/bin/docker inspect --format {{.State.ExitCode}} 
container_e161_1711009858797_8304894_01_000015
2024-09-23 18:50:14,360 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(532)) - Exit code from docker inspect: 22
2024-09-23 18:50:14,360 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(532)) - Wrote the exit code 22 to 
/data6/hadoop/yarn/local/nmPrivate/application_1711009858797_8304894/container_e161_1711009858797_8304894_01_000015/container_e161_1711009858797_8304894_01_000015.pid.exitcode
2024-09-23 18:50:14,381 ERROR launcher.ContainerLaunch 
(ContainerLaunch.java:call(340)) - Failed to launch container due to 
configuration error.
org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container 
Executor reached unrecoverable exception
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleExitCode(LinuxContainerExecutor.java:615)
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:573)
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:479)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:513)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:323)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:106)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: 
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
 Launch container failed
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.launchContainer(DockerLinuxContainerRuntime.java:1099)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:166)
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:564)
        ... 8 more {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to