[jira] [Comment Edited] (YARN-9923) Detect missing Docker binary or not running Docker daemon

2019-11-13 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973197#comment-16973197
 ] 

Adam Antal edited comment on YARN-9923 at 11/13/19 9:58 AM:


Made a mistake in the findbugs-exclude.xml, fixed that and the valid checkstyle 
errors in  [^YARN-9923.004.patch].


was (Author: adam.antal):
Made a mistake in the findbugs-exclude.xml and also fixed the valid checkstyle 
errors in  [^YARN-9923.004.patch].

> Detect missing Docker binary or not running Docker daemon
> -
>
> Key: YARN-9923
> URL: https://issues.apache.org/jira/browse/YARN-9923
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, yarn
>Affects Versions: 3.2.1
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-9923.001.patch, YARN-9923.002.patch, 
> YARN-9923.003.patch, YARN-9923.004.patch
>
>
> Currently if a NodeManager is enabled to allocate Docker containers, but the 
> specified binary (docker.binary in the container-executor.cfg) is missing the 
> container allocation fails with the following error message:
> {noformat}
> Container launch fails
> Exit code: 29
> Exception message: Launch container failed
> Shell error output: sh: : No 
> such file or directory
> Could not inspect docker network to get type /usr/bin/docker network inspect 
> host --format='{{.Driver}}'.
> Error constructing docker command, docker error code=-1, error 
> message='Unknown error'
> {noformat}
> I suggest to add a property say "yarn.nodemanager.runtime.linux.docker.check" 
> to have the following options:
> - STARTUP: setting this option the NodeManager would not start if Docker 
> binaries are missing or the Docker daemon is not running (the exception is 
> considered FATAL during startup)
> - RUNTIME: would give a more detailed/user-friendly exception in 
> NodeManager's side (NM logs) if Docker binaries are missing or the daemon is 
> not working. This would also prevent further Docker container allocation as 
> long as the binaries do not exist and the docker daemon is not running.
> - NONE (default): preserving the current behaviour, throwing exception during 
> container allocation, carrying on using the default retry procedure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9923) Detect missing Docker binary or not running Docker daemon

2019-11-09 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970862#comment-16970862
 ] 

Szilard Nemeth edited comment on YARN-9923 at 11/9/19 4:42 PM:
---

Hi [~adam.antal]!

Really like this rework of the NM health checks and the way you did implement 
the common interface for all healthchecks. This was a refactor could have been 
done long time ago. Kudos for it :) 

Some comments: 
1. As the change is bigger, this patch is not only deals with Docker health 
checks but has further refactors. 
Please include your design decisions in the jira description: What is the 
purpose of the common interface, what refactors did you done, how the 
implementors of the interface behave, etc.
Without these written in the description, it was pretty difficult to review the 
patch.

2. Nit: In NodeHealthScriptRunner: I know you moved this part of the code, but 
let's fix this javadoc: 
  {code}
  /** Time after which the script should be timedout. */
  private long scriptTimeout;
  {code}

  "timedout" should be "timed out".

3. Nit: In NodeHealthScriptRunner.newInstance: I think you could modify this 
log statement a bit:
  {code}
 LOG.info("Node Manager health check script is not available "
  + "or doesn't have execute permission, so not "
  + "starting the node health script runner.");
  {code}
  I think you could change the string "node health script runner" to the class 
name instead: NodeHealthScriptRunner. Maybe this way, the log message is more 
explicit and to the point.

4. In NodeHealthScriptRunner.reportHealthStatus: I think this info log should 
be warn instead:
 {code}
   default:
LOG.info("Unknown HealthCheckerExitStatus - ignored.");
break;
 {code}

In the contstuctor of NodeHealthScriptRunner, you have this statement: 
{code}
super(NodeHealthScriptRunner.class.getName(), chkInterval);
{code}
Typo in name "chkInterval".

5. Nit: Can you move the constructor of NodeHealthScriptRunner above or below 
the newInstane method? It was pretty hard to find it down there.

6. Nit: Javadoc of NodeHealthScriptRunner#shouldRun could be improved: 
{code}
 Method used to determine if or not node health monitoring service should be
   * started or not. Returns true if following conditions are met:
{code}
I would modify the first line as: "Method used to determine whether the health 
monitoring service should be started or not".

7. If NodeHealthScriptRunner#shouldRun returns false, can you log the same 
message but on warn or error level? I think this is an error case, as the 
script is null or empty. Isn't it? 
Maybe you could also log the script's file name on debug level, but maybe not 
in this method.

8. Can you use an enum to mark successful / unsuccessful cases 
NodeHealthMonitorExecutor#reportHealthStatus calls setHealthStatus? I don't 
think the boolean flag is the cleanest approach. Moreover, you have the log 
statement: 
{code}
LOG.info("health status being set as " + output);
{code}, that looks weird with a true/false value.

9. Nit: In NodeHealthMonitorExecutor#reportHealthStatus: The branches SUCCESS 
and FAILED_WITH_EXIT_CODE can be merged together as they are running the same 
code.

10. Nit: Javadoc of NodeHealthMonitorExecutor#hasErrors looks weird for 
parameter 'output'

11. In the constructor of NodeHealthScriptRunner: I would simply do: 
{code}
this.task = new NodeHealthMonitorExecutor(scriptArgs);
{code}, making the code more readable and making setTimerTask only used by 
tests.

12. Nit: Access on method TimedHealthReporterService#setLastReportedTime can be 
private.

13. Is it intentional that TimedHealthReporterService#setHealthStatus(boolean, 
java.lang.String) is not calling this.setLastReportedTime(); with the current 
time? If it is, why?

14. Nit: In the javadoc of class NodeHealthCheckerService: "The class..." 
should be "This class".

15. In the javadoc of class NodeHealthCheckerService: Can you add some basic 
examples on how to use this class, how reporters should be added, etc?

16. In NodeHealthCheckerService#addHealthReporter, if the 1st if-condition 
(noneMatch) fails, can you log something? AFAIU, if this condition is false, 
someone tried to add a duplicate service, so this is more likely a programming 
error.

17. Nit: Could you rephrase the javadoc of 
NodeHealthCheckerService#getHealthReport?

18. Question about NodeHealthCheckerService#isHealthy and its usage of 
allMatch: What if all reporters are returning false as the return value of 
isHealthy? Would allMatch return true for this case? Can you write a unit test 
for this case?

19. Nit: 
DockerHealthCheckerService.DockerDaemonMonitorExecutor#getPossiblePidFileLocations:
 Can you store "docker.pid" as a constant?

20. Comments for DockerDaemonMonitorExecutor#run: 
- Please log if none of the pid file variants are found - in other words, when 
you need 

[jira] [Comment Edited] (YARN-9923) Detect missing Docker binary or not running Docker daemon

2019-10-22 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956995#comment-16956995
 ] 

Peter Bacsko edited comment on YARN-9923 at 10/22/19 12:16 PM:
---

_"NONE (default): preserving the current behaviour [...]"_

Even the current behaviour can be improved. Right now there are multiple error 
messages, one after the other. If the binary is missing, there's no need to 
emit multiple lines. Simply say "Fatal error: /usr/bin/docker is missing" and 
exit immediately.


was (Author: pbacsko):
_"NONE (default): preserving the current behaviour [...]"_

Even the current behaviour can be improved. Right now there are multiple error 
messages, one after the another. If the binary is missing, there's no need to 
emit multiple lines. Simply say "Fatal error: /usr/bin/docker is missing" and 
exit immediately.

> Detect missing Docker binary or not running Docker daemon
> -
>
> Key: YARN-9923
> URL: https://issues.apache.org/jira/browse/YARN-9923
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, yarn
>Affects Versions: 3.2.1
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
>
> Currently if a NodeManager is enabled to allocate Docker containers, but the 
> specified binary (docker.binary in the container-executor.cfg) is missing the 
> container allocation fails with the following error message:
> {noformat}
> Container launch fails
> Exit code: 29
> Exception message: Launch container failed
> Shell error output: sh: : No 
> such file or directory
> Could not inspect docker network to get type /usr/bin/docker network inspect 
> host --format='{{.Driver}}'.
> Error constructing docker command, docker error code=-1, error 
> message='Unknown error'
> {noformat}
> I suggest to add a property say "yarn.nodemanager.runtime.linux.docker.check" 
> to have the following options:
> - STARTUP: setting this option the NodeManager would not start if Docker 
> binaries are missing or the Docker daemon is not running (the exception is 
> considered FATAL during startup)
> - RUNTIME: would give a more detailed/user-friendly exception in 
> NodeManager's side (NM logs) if Docker binaries are missing or the daemon is 
> not working. This would also prevent further Docker container allocation as 
> long as the binaries do not exist and the docker daemon is not running.
> - NONE (default): preserving the current behaviour, throwing exception during 
> container allocation, carrying on using the default retry procedure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org