[ 
https://issues.apache.org/jira/browse/YARN-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972480#comment-16972480
 ] 

Adam Antal commented on YARN-9923:
----------------------------------

Hi [~snemeth]!

Thanks for the thorough review. I have found your points reasonable, let me 
reflect those.

Resolved the following points: 2, 3, 4, 5, 6, 9, 10, 12, 14, 17, 19. Please 
verify whether I have understood your comments and fixed those items properly.

Some extra things: 
1. Will do this after this comment.
7. Partly do that, please check the function 
{{NodeHealthScriptRunner#newInstance}}.
8. I refactored a bit more {{TimedHealthReporterService#setHealthStatus}}, now 
every call sets the {{lastReportedTime}} attribute as well. Also I don't want 
to change the basic idea behind "being health or not", so I think the boolean 
flag is just enough for stating the main information - is the node healthy. If 
not, every other detail should go into the health report which can a be as 
verbose as the writer of the services wants it to be.
11. Can not be resolved since managing {{this.task}} is the responsibility of 
the super class ({{TimedHealthReporterService}}), and I think it should not set 
directly to there.
13. I am not sure about that, but now it does. See 8. for further detail.
15. Please check whether it makes sense. Also a new test case is added to 
demonstrate this - please take a look at 
({{TestNodeHealthCheckerService#testCustomHealthReporter}}).
16. No, it's not a programmatic error. This was a solution to mock a specific 
service for testing (we first add the mocked service of dirsHandler for 
example, and later when we want to do this again inside the logic of the class, 
the mocked one will be kept). Added debug level logging there.
18. It's an allMatch looking for all {{HealthReporter}} s to return true, if at 
least one of it returns false, it will fail. It will fail in case every service 
returns false. I don't see why we should add test for this - that would test 
the allMatch function and not the behaviour of the class.
20. The whole function got a bit simplified there - using less low-level calls. 
Could you please take a look at that?
21. Yes, the tests were not finished, I finished it in patch v3. Could you 
check?

Thanks!

> Detect missing Docker binary or not running Docker daemon
> ---------------------------------------------------------
>
>                 Key: YARN-9923
>                 URL: https://issues.apache.org/jira/browse/YARN-9923
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager, yarn
>    Affects Versions: 3.2.1
>            Reporter: Adam Antal
>            Assignee: Adam Antal
>            Priority: Major
>         Attachments: YARN-9923.001.patch, YARN-9923.002.patch
>
>
> Currently if a NodeManager is enabled to allocate Docker containers, but the 
> specified binary (docker.binary in the container-executor.cfg) is missing the 
> container allocation fails with the following error message:
> {noformat}
> Container launch fails
> Exit code: 29
> Exception message: Launch container failed
> Shell error output: sh: <docker binary path, /usr/bin/docker by default>: No 
> such file or directory
> Could not inspect docker network to get type /usr/bin/docker network inspect 
> host --format='{{.Driver}}'.
> Error constructing docker command, docker error code=-1, error 
> message='Unknown error'
> {noformat}
> I suggest to add a property say "yarn.nodemanager.runtime.linux.docker.check" 
> to have the following options:
> - STARTUP: setting this option the NodeManager would not start if Docker 
> binaries are missing or the Docker daemon is not running (the exception is 
> considered FATAL during startup)
> - RUNTIME: would give a more detailed/user-friendly exception in 
> NodeManager's side (NM logs) if Docker binaries are missing or the daemon is 
> not working. This would also prevent further Docker container allocation as 
> long as the binaries do not exist and the docker daemon is not running.
> - NONE (default): preserving the current behaviour, throwing exception during 
> container allocation, carrying on using the default retry procedure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to