[ 
https://issues.apache.org/jira/browse/YARN-11959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-11959:
----------------------------------
    Labels: pull-request-available  (was: )

> NodeManager becomes unhealthy when container exits with code 22 or 24
> ---------------------------------------------------------------------
>
>                 Key: YARN-11959
>                 URL: https://issues.apache.org/jira/browse/YARN-11959
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: KWON BYUNGCHANG
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: YARN-11959.001.patch
>
>
> When a user container exits with code 22 or 24, the NodeManager becomes 
> unhealthy and no more containers are allocated to that node. This situation 
> can be resolved by restarting the NodeManager.
>  
>  
> It can be reproduced immediately by running Scala Spark wordcount job that 
> exits with code 22.
>  
>  
> I propose to fix this by wrapping exit code 22 or 24 with different exit 
> code, so that ConfigurationException that causes NodeManager to become 
> unhealthy is not triggered.
>  
> {noformat}
> 2024-09-23 18:50:14,360 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(532)) - Obtaining the exit code...
> 2024-09-23 18:50:14,360 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(532)) - Docker inspect command: 
> /usr/bin/docker inspect --format {{.State.ExitCode}} 
> container_e161_1711009858797_8304894_01_000015
> 2024-09-23 18:50:14,360 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(532)) - Exit code from docker inspect: 22
> 2024-09-23 18:50:14,360 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(532)) - Wrote the exit code 22 to 
> /data6/hadoop/yarn/local/nmPrivate/application_1711009858797_8304894/container_e161_1711009858797_8304894_01_000015/container_e161_1711009858797_8304894_01_000015.pid.exitcode
> 2024-09-23 18:50:14,381 ERROR launcher.ContainerLaunch 
> (ContainerLaunch.java:call(340)) - Failed to launch container due to 
> configuration error.
> org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container 
> Executor reached unrecoverable exception
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleExitCode(LinuxContainerExecutor.java:615)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:573)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:479)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:513)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:323)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:106)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
>  Launch container failed
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.launchContainer(DockerLinuxContainerRuntime.java:1099)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:166)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:564)
>         ... 8 more {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to