[ https://issues.apache.org/jira/browse/YARN-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974676#comment-16974676 ]
Eric Yang commented on YARN-9923: --------------------------------- [~adam.antal] {quote}do you mean that there was no public or hadoop-public API for health-checking on purpose?{quote} I don't know if It was intentionally omitted, but I admit I didn't spent much time thinking about this API. Pluggable health check interface is good. There is no doubt about health check interface is a good feature for other people to implement their own health check implementation. I only disagree on using Java to check Docker is a good pattern due to missing permissions to access privileged operations. {quote}One improvement I can think of is to enable to set these things on a per script basis (allowing multiple scripts to run paralel).{quote} Personally, I would prefer to avoid multi-script approach. Apache common logging is one of real lesson that I learn from Hadoop that having too many run away threads making logging expensive and hard to debug where is the failure. We have moved to slf4j to reduce some of that bloat. A single script runs under 30 seconds with 15 minutes interval, is more preferable by most system administrators. We don't want to burn too many cpu cycles by healthcheck scripts. The script itself can be organized into functions to keep things tidy, and potentially move some of the functions to Hadoop libexec scripts to keep the parts hackable and tidy. {quote}For the sake of completeness a use case: In a cluster where Dockerized nodes with GPU are running TF jobs and nodes may depend on the availability of the Docker daemon as well as the GPU device, as of now we can only be sure that the node is working fine, if a container allocation is started on that node. {quote} If the config toggle via environment variable can work, node manager can make decision of which part of the health check functions to run base on node manager own config. This can prevent container to be schedule on unhealthy node base on above use case. I think the outcome could be a better overall solution. Wouldn't you agree? > Introduce HealthReporter interface and implement running Docker daemon checker > ------------------------------------------------------------------------------ > > Key: YARN-9923 > URL: https://issues.apache.org/jira/browse/YARN-9923 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn > Affects Versions: 3.2.1 > Reporter: Adam Antal > Assignee: Adam Antal > Priority: Major > Attachments: YARN-9923.001.patch, YARN-9923.002.patch, > YARN-9923.003.patch, YARN-9923.004.patch > > > Currently if a NodeManager is enabled to allocate Docker containers, but the > specified binary (docker.binary in the container-executor.cfg) is missing the > container allocation fails with the following error message: > {noformat} > Container launch fails > Exit code: 29 > Exception message: Launch container failed > Shell error output: sh: <docker binary path, /usr/bin/docker by default>: No > such file or directory > Could not inspect docker network to get type /usr/bin/docker network inspect > host --format='{{.Driver}}'. > Error constructing docker command, docker error code=-1, error > message='Unknown error' > {noformat} > I suggest to add a property say "yarn.nodemanager.runtime.linux.docker.check" > to have the following options: > - STARTUP: setting this option the NodeManager would not start if Docker > binaries are missing or the Docker daemon is not running (the exception is > considered FATAL during startup) > - RUNTIME: would give a more detailed/user-friendly exception in > NodeManager's side (NM logs) if Docker binaries are missing or the daemon is > not working. This would also prevent further Docker container allocation as > long as the binaries do not exist and the docker daemon is not running. > - NONE (default): preserving the current behaviour, throwing exception during > container allocation, carrying on using the default retry procedure. > ------------------------------------------------------------------------------------------------ > A new interface called {{HealthChecker}} is introduced which is used in the > {{NodeHealthCheckerService}}. Currently existing implementations like > {{LocalDirsHandlerService}} are modified to implement this giving a clear > abstraction to the node's health. The {{DockerHealthChecker}} implements this > new interface. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org