KWON BYUNGCHANG created YARN-11959:
--------------------------------------
Summary: NodeManager becomes unhealthy when container exits with
code 22 or 24
Key: YARN-11959
URL: https://issues.apache.org/jira/browse/YARN-11959
Project: Hadoop YARN
Issue Type: Bug
Reporter: KWON BYUNGCHANG
When a user container exits with code 22 or 24, the NodeManager becomes
unhealthy and no more containers are allocated to that node. This situation can
be resolved by restarting the NodeManager.
It can be reproduced immediately by running Scala Spark wordcount job that
exits with code 22.
I propose to fix this by wrapping exit code 22 or 24 with different exit code,
so that ConfigurationException that causes NodeManager to become unhealthy is
not triggered.
{noformat}
2024-09-23 18:50:14,360 INFO nodemanager.ContainerExecutor
(ContainerExecutor.java:logOutput(532)) - Obtaining the exit code...
2024-09-23 18:50:14,360 INFO nodemanager.ContainerExecutor
(ContainerExecutor.java:logOutput(532)) - Docker inspect command:
/usr/bin/docker inspect --format {{.State.ExitCode}}
container_e161_1711009858797_8304894_01_000015
2024-09-23 18:50:14,360 INFO nodemanager.ContainerExecutor
(ContainerExecutor.java:logOutput(532)) - Exit code from docker inspect: 22
2024-09-23 18:50:14,360 INFO nodemanager.ContainerExecutor
(ContainerExecutor.java:logOutput(532)) - Wrote the exit code 22 to
/data6/hadoop/yarn/local/nmPrivate/application_1711009858797_8304894/container_e161_1711009858797_8304894_01_000015/container_e161_1711009858797_8304894_01_000015.pid.exitcode
2024-09-23 18:50:14,381 ERROR launcher.ContainerLaunch
(ContainerLaunch.java:call(340)) - Failed to launch container due to
configuration error.
org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container
Executor reached unrecoverable exception
at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleExitCode(LinuxContainerExecutor.java:615)
at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:573)
at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:479)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:513)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:323)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:106)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by:
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
Launch container failed
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.launchContainer(DockerLinuxContainerRuntime.java:1099)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:166)
at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:564)
... 8 more {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]