[
https://issues.apache.org/jira/browse/YARN-11817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susheel Gupta reassigned YARN-11817:
------------------------------------
Assignee: Susheel Gupta
> Differentiate between container-executor and application exit codes to
> prevent false NM health issues.
> ------------------------------------------------------------------------------------------------------
>
> Key: YARN-11817
> URL: https://issues.apache.org/jira/browse/YARN-11817
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: yarn
> Reporter: Susheel Gupta
> Assignee: Susheel Gupta
> Priority: Major
>
> YARN treats container exit code 24 as a critical error (INVALID_CONFIG_FILE)
> and marks the NodeManager as unhealthy. However, some applications also use
> exit code 24 for their own logic—like signaling a missing config file. Since
> YARN can’t distinguish between executor-level errors and app-level exit
> codes, it ends up flagging healthy NodeManagers as unhealthy, which affects
> other apps running on the same node.
> {noformat}
> 2025-04-13 10:36:21,919 WARN
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception
> from container-launch with container ID:
> container_e51_1739441938175_0092_02_000001 and exit code: 24
> org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
> Launch container failed
> ...
> 2025-04-13 10:36:21,920 ERROR
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
> Failed to launch container due to configuration error.
> org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container
> Executor reached unrecoverable exception{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]