Jim Brennan created YARN-10477: ---------------------------------- Summary: runc launch failure should not cause nodemanager to go unhealthy Key: YARN-10477 URL: https://issues.apache.org/jira/browse/YARN-10477 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.3.1, 3.4.1 Reporter: Jim Brennan Assignee: Jim Brennan
We have observed some failures when launching containers with runc. We have not yet identified the root cause of those failures, but a side-effect of these failures was the Nodemanager marked itself unhealthy. Since these are rare failures that only affect a single launch, they should not cause the Nodemanager to be marked unhealthy. Here is an example RM log: {noformat} resourcemanager.log.2020-10-02-03.bz2:2020-10-02 03:20:10,255 [RM Event dispatcher] INFO rmnode.RMNodeImpl: Node node:8041 reported UNHEALTHY with details: Linux Container Executor reached unrecoverable exception {noformat} And here is an example of the NM log: {noformat} 2020-10-02 03:20:02,033 [ContainersLauncher #434] INFO runtime.RuncContainerRuntime: Launch container failed for container_e25_1601602719874_10691_01_001723 org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=24: OCI command has bad/missing local dire ctories {noformat} The problem is that the runc code in container-executor is re-using exit code 24 (INVALID_CONFIG_FILE) which is intended for problems with the container-executor.cfg file, and those failures are fatal for the NM. We should use a different exit code for these. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org