[
https://issues.apache.org/jira/browse/YARN-11959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079607#comment-18079607
]
ASF GitHub Bot commented on YARN-11959:
---------------------------------------
Copilot commented on code in PR #8474:
URL: https://github.com/apache/hadoop/pull/8474#discussion_r3212206555
##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c:
##########
@@ -653,6 +653,16 @@ static int validate_run_as_user_commands(int argc, char
**argv, int *operation)
}
}
+static int wrap_exit_code(int exit_code) {
+ if (exit_code == INVALID_CONTAINER_EXEC_PERMISSIONS || exit_code ==
INVALID_CONFIG_FILE) {
+ int wrap_code = WRAPPED_EXIT_CODE_USER_CONTAINER_FAILED;
+ fprintf(LOGFILE, "Wrapped exit code of user container from %d to %d to
avoid NodeManager unhealthy...\n", exit_code, wrap_code);
Review Comment:
The log message "to avoid NodeManager unhealthy..." is grammatically unclear
and doesn’t explain what condition is being avoided. Consider rewording to
something like "to avoid NodeManager becoming unhealthy" and optionally mention
it’s avoiding the ConfigurationException path in LinuxContainerExecutor for
exit codes 22/24.
##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/util.h:
##########
@@ -109,7 +109,8 @@ enum errorcodes {
TOO_LONG_EXECUTOR_PATH = 81,
CANNOT_GET_EXECUTABLE_NAME_FROM_KERNEL = 82,
CANNOT_GET_EXECUTABLE_NAME_FROM_PID = 83,
- WRONG_PATH_OF_EXECUTABLE = 84
+ WRONG_PATH_OF_EXECUTABLE = 84,
+ WRAPPED_EXIT_CODE_USER_CONTAINER_FAILED = 85
};
Review Comment:
New exit code 85 is added to util.h, but get_error_message() in util.c
doesn’t have a case for it, so container-executor will log "Unknown error code"
for this wrapped value. Also, util.h error codes appear to be mirrored by
LinuxContainerExecutor.ExitCode in Java (up to 84), so consider updating those
mappings to keep enums/messages consistent for diagnostics.
##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c:
##########
@@ -653,6 +653,16 @@ static int validate_run_as_user_commands(int argc, char
**argv, int *operation)
}
}
+static int wrap_exit_code(int exit_code) {
+ if (exit_code == INVALID_CONTAINER_EXEC_PERMISSIONS || exit_code ==
INVALID_CONFIG_FILE) {
+ int wrap_code = WRAPPED_EXIT_CODE_USER_CONTAINER_FAILED;
+ fprintf(LOGFILE, "Wrapped exit code of user container from %d to %d to
avoid NodeManager unhealthy...\n", exit_code, wrap_code);
+ return wrap_code;
+ } else {
+ return exit_code;
+ }
Review Comment:
wrap_exit_code() changes user container exit codes 22 and 24 to a single
wrapped value (85). This masks the original container exit code in YARN
diagnostics and makes 22 vs 24 indistinguishable; consider preserving the
original information (e.g., distinct wrapped codes per original value, or an
offset scheme) and/or surfacing the original exit code in diagnostics so users
can still debug failures accurately.
> NodeManager becomes unhealthy when container exits with code 22 or 24
> ---------------------------------------------------------------------
>
> Key: YARN-11959
> URL: https://issues.apache.org/jira/browse/YARN-11959
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: KWON BYUNGCHANG
> Priority: Major
> Labels: pull-request-available
> Attachments: YARN-11959.001.patch
>
>
> When a user container exits with code 22 or 24, the NodeManager becomes
> unhealthy and no more containers are allocated to that node. This situation
> can be resolved by restarting the NodeManager.
>
>
> It can be reproduced immediately by running Scala Spark wordcount job that
> exits with code 22.
>
>
> I propose to fix this by wrapping exit code 22 or 24 with different exit
> code, so that ConfigurationException that causes NodeManager to become
> unhealthy is not triggered.
>
> {noformat}
> 2024-09-23 18:50:14,360 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(532)) - Obtaining the exit code...
> 2024-09-23 18:50:14,360 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(532)) - Docker inspect command:
> /usr/bin/docker inspect --format {{.State.ExitCode}}
> container_e161_1711009858797_8304894_01_000015
> 2024-09-23 18:50:14,360 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(532)) - Exit code from docker inspect: 22
> 2024-09-23 18:50:14,360 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(532)) - Wrote the exit code 22 to
> /data6/hadoop/yarn/local/nmPrivate/application_1711009858797_8304894/container_e161_1711009858797_8304894_01_000015/container_e161_1711009858797_8304894_01_000015.pid.exitcode
> 2024-09-23 18:50:14,381 ERROR launcher.ContainerLaunch
> (ContainerLaunch.java:call(340)) - Failed to launch container due to
> configuration error.
> org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container
> Executor reached unrecoverable exception
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleExitCode(LinuxContainerExecutor.java:615)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:573)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:479)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:513)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:323)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:106)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
> Launch container failed
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.launchContainer(DockerLinuxContainerRuntime.java:1099)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:166)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:564)
> ... 8 more {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]