[
https://issues.apache.org/jira/browse/YARN-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974676#comment-16974676
]
Eric Yang commented on YARN-9923:
---------------------------------
[~adam.antal]
{quote}do you mean that there was no public or hadoop-public API for
health-checking on purpose?{quote}
I don't know if It was intentionally omitted, but I admit I didn't spent much
time thinking about this API. Pluggable health check interface is good. There
is no doubt about health check interface is a good feature for other people to
implement their own health check implementation. I only disagree on using Java
to check Docker is a good pattern due to missing permissions to access
privileged operations.
{quote}One improvement I can think of is to enable to set these things on a per
script basis (allowing multiple scripts to run paralel).{quote}
Personally, I would prefer to avoid multi-script approach. Apache common
logging is one of real lesson that I learn from Hadoop that having too many run
away threads making logging expensive and hard to debug where is the failure.
We have moved to slf4j to reduce some of that bloat. A single script runs
under 30 seconds with 15 minutes interval, is more preferable by most system
administrators. We don't want to burn too many cpu cycles by healthcheck
scripts. The script itself can be organized into functions to keep things
tidy, and potentially move some of the functions to Hadoop libexec scripts to
keep the parts hackable and tidy.
{quote}For the sake of completeness a use case: In a cluster where Dockerized
nodes with GPU are running TF jobs and nodes may depend on the availability of
the Docker daemon as well as the GPU device, as of now we can only be sure that
the node is working fine, if a container allocation is started on that node.
{quote}
If the config toggle via environment variable can work, node manager can make
decision of which part of the health check functions to run base on node
manager own config. This can prevent container to be schedule on unhealthy
node base on above use case. I think the outcome could be a better overall
solution. Wouldn't you agree?
> Introduce HealthReporter interface and implement running Docker daemon checker
> ------------------------------------------------------------------------------
>
> Key: YARN-9923
> URL: https://issues.apache.org/jira/browse/YARN-9923
> Project: Hadoop YARN
> Issue Type: New Feature
> Components: nodemanager, yarn
> Affects Versions: 3.2.1
> Reporter: Adam Antal
> Assignee: Adam Antal
> Priority: Major
> Attachments: YARN-9923.001.patch, YARN-9923.002.patch,
> YARN-9923.003.patch, YARN-9923.004.patch
>
>
> Currently if a NodeManager is enabled to allocate Docker containers, but the
> specified binary (docker.binary in the container-executor.cfg) is missing the
> container allocation fails with the following error message:
> {noformat}
> Container launch fails
> Exit code: 29
> Exception message: Launch container failed
> Shell error output: sh: <docker binary path, /usr/bin/docker by default>: No
> such file or directory
> Could not inspect docker network to get type /usr/bin/docker network inspect
> host --format='{{.Driver}}'.
> Error constructing docker command, docker error code=-1, error
> message='Unknown error'
> {noformat}
> I suggest to add a property say "yarn.nodemanager.runtime.linux.docker.check"
> to have the following options:
> - STARTUP: setting this option the NodeManager would not start if Docker
> binaries are missing or the Docker daemon is not running (the exception is
> considered FATAL during startup)
> - RUNTIME: would give a more detailed/user-friendly exception in
> NodeManager's side (NM logs) if Docker binaries are missing or the daemon is
> not working. This would also prevent further Docker container allocation as
> long as the binaries do not exist and the docker daemon is not running.
> - NONE (default): preserving the current behaviour, throwing exception during
> container allocation, carrying on using the default retry procedure.
> ------------------------------------------------------------------------------------------------
> A new interface called {{HealthChecker}} is introduced which is used in the
> {{NodeHealthCheckerService}}. Currently existing implementations like
> {{LocalDirsHandlerService}} are modified to implement this giving a clear
> abstraction to the node's health. The {{DockerHealthChecker}} implements this
> new interface.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]