[ 
https://issues.apache.org/jira/browse/YARN-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974423#comment-16974423
 ] 

Adam Antal commented on YARN-9923:
----------------------------------

Hi [~eyang],

Thanks for sharing these concerns. I can see you point you come up, let me 
react to those.
{quote}it would be better to be written as a health checker script hence, 
proper privileges can be obtained to check if the process is alive.
{quote}
Yeah, we can move the logic to a script.
{quote}Worries about future maintenance of DockerHealthCheckerService. [...]
{quote}
So do you mean that there was no public or hadoop-public API for 
health-checking on purpose? That was the missing piece that {{HealthChecker}} 
interface was trying to fill, but I can see the reason why from this 
perspective. 
 I consider using scripts a bit less flexible, that was my main reason why I 
started to work on this. Suppose you have a complex health checker logic, and 
you have multiple scripts implementing the parts. Then you must either have one 
giant script that contains all of it and could not be supported on the long 
run, or have one that imports all the others but still does not give you the 
right flexibility.

One improvement I can think of is to enable to set these things on a per script 
basis (allowing multiple scripts to run paralel). Similarly like log 
aggregation controller, we could have a configs like this:
{noformat}
yarn.health-checker.scripts=script1,script2,script3
yarn.health-checker-script.script1.path=/path/to/script/one
yarn.health-checker-script.script1.opts=...
yarn.health-checker-script.script1.timeout-ms=1000
yarn.health-checker-script.script2.path=/path/to/script/two
...
yarn.health-checker-script.script3.path=/path/to/script/three
{noformat}
{quote}Future code may have too many timer threads that runs at different 
intervals and use more system resources than necessary.
{quote}
If you believe the above approach can make sense, then we can limit this with a 
config - a max-timer-thread option to limit these threads (warning if you have 
given too much of these health checker scripts).

bq. Could we change DockerHealthCheckerService into node manager healthcheck 
script? The toggle for enable docker check can be based on environment 
variables that pass down to node manager healthcheck script. This approach will 
set good example of how to pass config toggle to healthcheck script, while 
maintaining backward compatibility to user's healthcheck script.

I like this idea, will do that in the following patch.

What do you think about this? Please share me your thoughts on this.

For the sake of completeness a use case: In a cluster where Dockerized nodes 
with GPU are running TF jobs and nodes may depend on the availability of the 
Docker daemon as well as the GPU device, as of now we can only be sure that the 
node is working fine, if a container allocation is started on that node. 

> Introduce HealthReporter interface and implement running Docker daemon checker
> ------------------------------------------------------------------------------
>
>                 Key: YARN-9923
>                 URL: https://issues.apache.org/jira/browse/YARN-9923
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager, yarn
>    Affects Versions: 3.2.1
>            Reporter: Adam Antal
>            Assignee: Adam Antal
>            Priority: Major
>         Attachments: YARN-9923.001.patch, YARN-9923.002.patch, 
> YARN-9923.003.patch, YARN-9923.004.patch
>
>
> Currently if a NodeManager is enabled to allocate Docker containers, but the 
> specified binary (docker.binary in the container-executor.cfg) is missing the 
> container allocation fails with the following error message:
> {noformat}
> Container launch fails
> Exit code: 29
> Exception message: Launch container failed
> Shell error output: sh: <docker binary path, /usr/bin/docker by default>: No 
> such file or directory
> Could not inspect docker network to get type /usr/bin/docker network inspect 
> host --format='{{.Driver}}'.
> Error constructing docker command, docker error code=-1, error 
> message='Unknown error'
> {noformat}
> I suggest to add a property say "yarn.nodemanager.runtime.linux.docker.check" 
> to have the following options:
> - STARTUP: setting this option the NodeManager would not start if Docker 
> binaries are missing or the Docker daemon is not running (the exception is 
> considered FATAL during startup)
> - RUNTIME: would give a more detailed/user-friendly exception in 
> NodeManager's side (NM logs) if Docker binaries are missing or the daemon is 
> not working. This would also prevent further Docker container allocation as 
> long as the binaries do not exist and the docker daemon is not running.
> - NONE (default): preserving the current behaviour, throwing exception during 
> container allocation, carrying on using the default retry procedure.
> ------------------------------------------------------------------------------------------------
> A new interface called {{HealthChecker}} is introduced which is used in the 
> {{NodeHealthCheckerService}}. Currently existing implementations like 
> {{LocalDirsHandlerService}} are modified to implement this giving a clear 
> abstraction to the node's health. The {{DockerHealthChecker}} implements this 
> new interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to