[
https://issues.apache.org/jira/browse/YARN-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983403#comment-16983403
]
Adam Antal commented on YARN-9923:
----------------------------------
Uploaded patchset v6.
- I was thinking about putting the docker health checker script itself to the
test resources (as the main part of this issue is to support running up to 4
scripts), but I think the behaviour of checking a script is well tested, so
adding this to the tests does not make any plus.
- The responsibility of setting the health bad is an exception has occurred was
refactored to a separate class. I think in this way . Added 3 levels of tests
to that: UT to {{ExceptionReporter}}, {{NodeHealthCheckerService}} and as an
integration test, added a case in {{NodeManager}}-level ({{NodeStatusUpdater}}).
> Introduce HealthReporter interface and implement running Docker daemon checker
> ------------------------------------------------------------------------------
>
> Key: YARN-9923
> URL: https://issues.apache.org/jira/browse/YARN-9923
> Project: Hadoop YARN
> Issue Type: New Feature
> Components: nodemanager, yarn
> Affects Versions: 3.2.1
> Reporter: Adam Antal
> Assignee: Adam Antal
> Priority: Major
> Attachments: YARN-9923.001.patch, YARN-9923.002.patch,
> YARN-9923.003.patch, YARN-9923.004.patch, YARN-9923.005.patch,
> YARN-9923.006.patch
>
>
> Currently if a NodeManager is enabled to allocate Docker containers, but the
> specified binary (docker.binary in the container-executor.cfg) is missing the
> container allocation fails with the following error message:
> {noformat}
> Container launch fails
> Exit code: 29
> Exception message: Launch container failed
> Shell error output: sh: <docker binary path, /usr/bin/docker by default>: No
> such file or directory
> Could not inspect docker network to get type /usr/bin/docker network inspect
> host --format='{{.Driver}}'.
> Error constructing docker command, docker error code=-1, error
> message='Unknown error'
> {noformat}
> I suggest to add a property say "yarn.nodemanager.runtime.linux.docker.check"
> to have the following options:
> - STARTUP: setting this option the NodeManager would not start if Docker
> binaries are missing or the Docker daemon is not running (the exception is
> considered FATAL during startup)
> - RUNTIME: would give a more detailed/user-friendly exception in
> NodeManager's side (NM logs) if Docker binaries are missing or the daemon is
> not working. This would also prevent further Docker container allocation as
> long as the binaries do not exist and the docker daemon is not running.
> - NONE (default): preserving the current behaviour, throwing exception during
> container allocation, carrying on using the default retry procedure.
> ------------------------------------------------------------------------------------------------
> A new interface called {{HealthChecker}} is introduced which is used in the
> {{NodeHealthCheckerService}}. Currently existing implementations like
> {{LocalDirsHandlerService}} are modified to implement this giving a clear
> abstraction to the node's health. The {{DockerHealthChecker}} implements this
> new interface.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]