Jim Brennan created YARN-10477:
----------------------------------
Summary: runc launch failure should not cause nodemanager to go
unhealthy
Key: YARN-10477
URL: https://issues.apache.org/jira/browse/YARN-10477
Project: Hadoop YARN
Issue Type: Bug
Components: yarn
Affects Versions: 3.3.1, 3.4.1
Reporter: Jim Brennan
Assignee: Jim Brennan
We have observed some failures when launching containers with runc. We have
not yet identified the root cause of those failures, but a side-effect of these
failures was the Nodemanager marked itself unhealthy. Since these are rare
failures that only affect a single launch, they should not cause the
Nodemanager to be marked unhealthy.
Here is an example RM log:
{noformat}
resourcemanager.log.2020-10-02-03.bz2:2020-10-02 03:20:10,255 [RM Event
dispatcher] INFO rmnode.RMNodeImpl: Node node:8041 reported UNHEALTHY with
details: Linux Container Executor reached unrecoverable exception
{noformat}
And here is an example of the NM log:
{noformat}
2020-10-02 03:20:02,033 [ContainersLauncher #434] INFO
runtime.RuncContainerRuntime: Launch container failed for
container_e25_1601602719874_10691_01_001723
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
ExitCodeException exitCode=24: OCI command has bad/missing local dire
ctories
{noformat}
The problem is that the runc code in container-executor is re-using exit code
24 (INVALID_CONFIG_FILE) which is intended for problems with the
container-executor.cfg file, and those failures are fatal for the NM. We
should use a different exit code for these.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]