Susheel Gupta created YARN-11817:
------------------------------------

             Summary: Differentiate between container-executor and application 
exit codes to prevent false NM health issues.
                 Key: YARN-11817
                 URL: https://issues.apache.org/jira/browse/YARN-11817
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: yarn
            Reporter: Susheel Gupta


YARN treats container exit code 24 as a critical error (INVALID_CONFIG_FILE) 
and marks the NodeManager as unhealthy. However, some applications also use 
exit code 24 for their own logic—like signaling a missing config file. Since 
YARN can’t distinguish between executor-level errors and app-level exit codes, 
it ends up flagging healthy NodeManagers as unhealthy, which affects other apps 
running on the same node.


{noformat}
2025-04-13 10:36:21,919 WARN 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception 
from container-launch with container ID: 
container_e51_1739441938175_0092_02_000001 and exit code: 24
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
 Launch container failed
...
2025-04-13 10:36:21,920 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Failed to launch container due to configuration error.
org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container 
Executor reached unrecoverable exception{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to