Jim Brennan created YARN-10477:
----------------------------------

             Summary: runc launch failure should not cause nodemanager to go 
unhealthy
                 Key: YARN-10477
                 URL: https://issues.apache.org/jira/browse/YARN-10477
             Project: Hadoop YARN
          Issue Type: Bug
          Components: yarn
    Affects Versions: 3.3.1, 3.4.1
            Reporter: Jim Brennan
            Assignee: Jim Brennan


We have observed some failures when launching containers with runc.  We have 
not yet identified the root cause of those failures, but a side-effect of these 
failures was the Nodemanager marked itself unhealthy.  Since these are rare 
failures that only affect a single launch, they should not cause the 
Nodemanager to be marked unhealthy.

Here is an example RM log:
{noformat}
resourcemanager.log.2020-10-02-03.bz2:2020-10-02 03:20:10,255 [RM Event 
dispatcher] INFO rmnode.RMNodeImpl: Node node:8041 reported UNHEALTHY with 
details: Linux Container Executor reached unrecoverable exception
{noformat}
And here is an example of the NM log:
{noformat}
2020-10-02 03:20:02,033 [ContainersLauncher #434] INFO 
runtime.RuncContainerRuntime: Launch container failed for 
container_e25_1601602719874_10691_01_001723
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
 ExitCodeException exitCode=24: OCI command has bad/missing local dire
ctories
{noformat}

The problem is that the runc code in container-executor is re-using exit code 
24 (INVALID_CONFIG_FILE) which is intended for problems with the 
container-executor.cfg file, and those failures are fatal for the NM.  We 
should use a different exit code for these.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to