[
https://issues.apache.org/jira/browse/YARN-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jim Brennan resolved YARN-10477.
--------------------------------
Resolution: Invalid
Closing this as invalid. The problem was only there in our internal version of
container-executor. I should have checked the code in trunk before filing.
> runc launch failure should not cause nodemanager to go unhealthy
> ----------------------------------------------------------------
>
> Key: YARN-10477
> URL: https://issues.apache.org/jira/browse/YARN-10477
> Project: Hadoop YARN
> Issue Type: Bug
> Components: yarn
> Affects Versions: 3.3.1, 3.4.1
> Reporter: Jim Brennan
> Assignee: Jim Brennan
> Priority: Major
>
> We have observed some failures when launching containers with runc. We have
> not yet identified the root cause of those failures, but a side-effect of
> these failures was the Nodemanager marked itself unhealthy. Since these are
> rare failures that only affect a single launch, they should not cause the
> Nodemanager to be marked unhealthy.
> Here is an example RM log:
> {noformat}
> resourcemanager.log.2020-10-02-03.bz2:2020-10-02 03:20:10,255 [RM Event
> dispatcher] INFO rmnode.RMNodeImpl: Node node:8041 reported UNHEALTHY with
> details: Linux Container Executor reached unrecoverable exception
> {noformat}
> And here is an example of the NM log:
> {noformat}
> 2020-10-02 03:20:02,033 [ContainersLauncher #434] INFO
> runtime.RuncContainerRuntime: Launch container failed for
> container_e25_1601602719874_10691_01_001723
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
> ExitCodeException exitCode=24: OCI command has bad/missing local dire
> ctories
> {noformat}
> The problem is that the runc code in container-executor is re-using exit code
> 24 (INVALID_CONFIG_FILE) which is intended for problems with the
> container-executor.cfg file, and those failures are fatal for the NM. We
> should use a different exit code for these.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]