Susheel Gupta created YARN-11817: ------------------------------------ Summary: Differentiate between container-executor and application exit codes to prevent false NM health issues. Key: YARN-11817 URL: https://issues.apache.org/jira/browse/YARN-11817 Project: Hadoop YARN Issue Type: Improvement Components: yarn Reporter: Susheel Gupta
YARN treats container exit code 24 as a critical error (INVALID_CONFIG_FILE) and marks the NodeManager as unhealthy. However, some applications also use exit code 24 for their own logic—like signaling a missing config file. Since YARN can’t distinguish between executor-level errors and app-level exit codes, it ends up flagging healthy NodeManagers as unhealthy, which affects other apps running on the same node. {noformat} 2025-04-13 10:36:21,919 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception from container-launch with container ID: container_e51_1739441938175_0092_02_000001 and exit code: 24 org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Launch container failed ... 2025-04-13 10:36:21,920 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Failed to launch container due to configuration error. org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container Executor reached unrecoverable exception{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org