[
https://issues.apache.org/jira/browse/YARN-11959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated YARN-11959:
----------------------------------
Labels: pull-request-available (was: )
> NodeManager becomes unhealthy when container exits with code 22 or 24
> ---------------------------------------------------------------------
>
> Key: YARN-11959
> URL: https://issues.apache.org/jira/browse/YARN-11959
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: KWON BYUNGCHANG
> Priority: Major
> Labels: pull-request-available
> Attachments: YARN-11959.001.patch
>
>
> When a user container exits with code 22 or 24, the NodeManager becomes
> unhealthy and no more containers are allocated to that node. This situation
> can be resolved by restarting the NodeManager.
>
>
> It can be reproduced immediately by running Scala Spark wordcount job that
> exits with code 22.
>
>
> I propose to fix this by wrapping exit code 22 or 24 with different exit
> code, so that ConfigurationException that causes NodeManager to become
> unhealthy is not triggered.
>
> {noformat}
> 2024-09-23 18:50:14,360 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(532)) - Obtaining the exit code...
> 2024-09-23 18:50:14,360 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(532)) - Docker inspect command:
> /usr/bin/docker inspect --format {{.State.ExitCode}}
> container_e161_1711009858797_8304894_01_000015
> 2024-09-23 18:50:14,360 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(532)) - Exit code from docker inspect: 22
> 2024-09-23 18:50:14,360 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(532)) - Wrote the exit code 22 to
> /data6/hadoop/yarn/local/nmPrivate/application_1711009858797_8304894/container_e161_1711009858797_8304894_01_000015/container_e161_1711009858797_8304894_01_000015.pid.exitcode
> 2024-09-23 18:50:14,381 ERROR launcher.ContainerLaunch
> (ContainerLaunch.java:call(340)) - Failed to launch container due to
> configuration error.
> org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container
> Executor reached unrecoverable exception
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleExitCode(LinuxContainerExecutor.java:615)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:573)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:479)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:513)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:323)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:106)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
> Launch container failed
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.launchContainer(DockerLinuxContainerRuntime.java:1099)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:166)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:564)
> ... 8 more {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]