[ 
https://issues.apache.org/jira/browse/YARN-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16966842#comment-16966842
 ] 

Adam Antal commented on YARN-9923:
----------------------------------

Thanks for looking into this [~ebadger]. I agree with your opinion - nm health 
check script would be a good solution for that.

As the disk health checker is a special health checker, I think this could be 
implemented just like that.
We can have the following configuration options:
{noformat}
yarn.nodemanager.docker-health-checker.enable
yarn.nodemanager.docker-health-checker.interval-ms 
{noformat}
Where the enable config is false by default (corresponding to the NONE mode) 
and only if set to true would enable the regular check of the Docker daemon 
(STARTUP and RUNTIME modes). However I see no special use case in the STARTUP 
mode, so I believe its fine to only implement the RUNTIME option - which means 
that if the Docker Daemon went offline, then the node would become unhealthy. 
This is handled properly by the RM, no need to shut down immediately.
The interval-ms config would work just like the regular node health script one.

> Detect missing Docker binary or not running Docker daemon
> ---------------------------------------------------------
>
>                 Key: YARN-9923
>                 URL: https://issues.apache.org/jira/browse/YARN-9923
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager, yarn
>    Affects Versions: 3.2.1
>            Reporter: Adam Antal
>            Assignee: Adam Antal
>            Priority: Major
>
> Currently if a NodeManager is enabled to allocate Docker containers, but the 
> specified binary (docker.binary in the container-executor.cfg) is missing the 
> container allocation fails with the following error message:
> {noformat}
> Container launch fails
> Exit code: 29
> Exception message: Launch container failed
> Shell error output: sh: <docker binary path, /usr/bin/docker by default>: No 
> such file or directory
> Could not inspect docker network to get type /usr/bin/docker network inspect 
> host --format='{{.Driver}}'.
> Error constructing docker command, docker error code=-1, error 
> message='Unknown error'
> {noformat}
> I suggest to add a property say "yarn.nodemanager.runtime.linux.docker.check" 
> to have the following options:
> - STARTUP: setting this option the NodeManager would not start if Docker 
> binaries are missing or the Docker daemon is not running (the exception is 
> considered FATAL during startup)
> - RUNTIME: would give a more detailed/user-friendly exception in 
> NodeManager's side (NM logs) if Docker binaries are missing or the daemon is 
> not working. This would also prevent further Docker container allocation as 
> long as the binaries do not exist and the docker daemon is not running.
> - NONE (default): preserving the current behaviour, throwing exception during 
> container allocation, carrying on using the default retry procedure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to