Devaraj K commented on YARN-2832:

Nice catch [~tianyin]. Thanks for your contribution.

The patch looks good to me except these comments.

- can you change the log level to INFO and log message similar to the one in 
LOG.info("Not starting node health monitor");

- and also can you remove the shouldRun() redundant check in 

> Wrong Check Logic of NodeHealthCheckerService Causes Latent Errors
> ------------------------------------------------------------------
>                 Key: YARN-2832
>                 URL: https://issues.apache.org/jira/browse/YARN-2832
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.4.1, 2.5.1
>         Environment: Any environment
>            Reporter: Tianyin Xu
>         Attachments: health.check.service.1.patch
> NodeManager allows users to specify the health checker script that will be 
> invoked by the health-checker service via the configuration parameter, 
> "_yarn.nodemanager.health-checker.script.path_" 
> During the _serviceInit()_ of the health-check service, NM checks whether the 
> parameter is set correctly using _shouldRun()_, as follows,
> {code:title=/* NodeHealthCheckerService.java */|borderStyle=solid}
>   protected void serviceInit(Configuration conf) throws Exception {
>     if (NodeHealthScriptRunner.shouldRun(conf)) {
>       nodeHealthScriptRunner = new NodeHealthScriptRunner();
>       addService(nodeHealthScriptRunner);
>     }
>     addService(dirsHandler);
>     super.serviceInit(conf);
>   }
> {code}
> The problem is that if the parameter is misconfigured (e.g., permission 
> problem, wrong path), NM does not have any log message to inform users which 
> could cause latent errors or mysterious problems (e.g., "why my scripts does 
> not work?")
> I see the checking and printing logic is put in _serviceStart()_ function in 
> _NodeHealthScriptRunner.java_ (see the following code snippets). However, the 
> logic is very wrong. For an incorrect parameter that does not pass the 
> "shouldRun" check, _serviceStart()_ would never be called because the 
> _NodeHealthScriptRunner_ instance does not have the chance to be created (see 
> the code snippets above).
> {code:title=/* NodeHealthScriptRunner.java */|borderStyle=solid}
>   protected void serviceStart() throws Exception {
>     // if health script path is not configured don't start the thread.
>     if (!shouldRun(conf)) {
>       LOG.info("Not starting node health monitor");
>       return;
>     }
>     ... 
>   }  
> {code}
> Basically, I think the checking and printing logic should be put in the 
> serviceInit() in NodeHealthCheckerService instead of serviceStart() in 
> NodeHealthScriptRunner.
> See the attachment for the simple patch.

This message was sent by Atlassian JIRA

Reply via email to