[ 
https://issues.apache.org/jira/browse/YARN-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974676#comment-16974676
 ] 

Eric Yang commented on YARN-9923:
---------------------------------

[~adam.antal] 

{quote}do you mean that there was no public or hadoop-public API for 
health-checking on purpose?{quote}

I don't know if It was intentionally omitted, but I admit I didn't spent much 
time thinking about this API.  Pluggable health check interface is good.  There 
is no doubt about health check interface is a good feature for other people to 
implement their own health check implementation.  I only disagree on using Java 
to check Docker is a good pattern due to missing permissions to access 
privileged operations.

{quote}One improvement I can think of is to enable to set these things on a per 
script basis (allowing multiple scripts to run paralel).{quote}

Personally, I would prefer to avoid multi-script approach.  Apache common 
logging is one of real lesson that I learn from Hadoop that having too many run 
away threads making logging expensive and hard to debug where is the failure.  
We have moved to slf4j to reduce some of that bloat.  A single script runs 
under 30 seconds with 15 minutes interval, is more preferable by most system 
administrators.  We don't want to burn too many cpu cycles by healthcheck 
scripts.  The script itself can be organized into functions to keep things 
tidy, and potentially move some of the functions to Hadoop libexec scripts to 
keep the parts hackable and tidy.

{quote}For the sake of completeness a use case: In a cluster where Dockerized 
nodes with GPU are running TF jobs and nodes may depend on the availability of 
the Docker daemon as well as the GPU device, as of now we can only be sure that 
the node is working fine, if a container allocation is started on that node. 
{quote}

If the config toggle via environment variable can work, node manager can make 
decision of which part of the health check functions to run base on node 
manager own config.  This can prevent container to be schedule on unhealthy 
node base on above use case.  I think the outcome could be a better overall 
solution.  Wouldn't you agree?

> Introduce HealthReporter interface and implement running Docker daemon checker
> ------------------------------------------------------------------------------
>
>                 Key: YARN-9923
>                 URL: https://issues.apache.org/jira/browse/YARN-9923
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager, yarn
>    Affects Versions: 3.2.1
>            Reporter: Adam Antal
>            Assignee: Adam Antal
>            Priority: Major
>         Attachments: YARN-9923.001.patch, YARN-9923.002.patch, 
> YARN-9923.003.patch, YARN-9923.004.patch
>
>
> Currently if a NodeManager is enabled to allocate Docker containers, but the 
> specified binary (docker.binary in the container-executor.cfg) is missing the 
> container allocation fails with the following error message:
> {noformat}
> Container launch fails
> Exit code: 29
> Exception message: Launch container failed
> Shell error output: sh: <docker binary path, /usr/bin/docker by default>: No 
> such file or directory
> Could not inspect docker network to get type /usr/bin/docker network inspect 
> host --format='{{.Driver}}'.
> Error constructing docker command, docker error code=-1, error 
> message='Unknown error'
> {noformat}
> I suggest to add a property say "yarn.nodemanager.runtime.linux.docker.check" 
> to have the following options:
> - STARTUP: setting this option the NodeManager would not start if Docker 
> binaries are missing or the Docker daemon is not running (the exception is 
> considered FATAL during startup)
> - RUNTIME: would give a more detailed/user-friendly exception in 
> NodeManager's side (NM logs) if Docker binaries are missing or the daemon is 
> not working. This would also prevent further Docker container allocation as 
> long as the binaries do not exist and the docker daemon is not running.
> - NONE (default): preserving the current behaviour, throwing exception during 
> container allocation, carrying on using the default retry procedure.
> ------------------------------------------------------------------------------------------------
> A new interface called {{HealthChecker}} is introduced which is used in the 
> {{NodeHealthCheckerService}}. Currently existing implementations like 
> {{LocalDirsHandlerService}} are modified to implement this giving a clear 
> abstraction to the node's health. The {{DockerHealthChecker}} implements this 
> new interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to