[ 
https://issues.apache.org/jira/browse/YARN-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K resolved YARN-2832.
-----------------------------
    Resolution: Duplicate

It is fixed as part of YARN-3375, closing as duplicate.

> Wrong Check Logic of NodeHealthCheckerService Causes Latent Errors
> ------------------------------------------------------------------
>
>                 Key: YARN-2832
>                 URL: https://issues.apache.org/jira/browse/YARN-2832
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.4.1, 2.5.1
>         Environment: Any environment
>            Reporter: Tianyin Xu
>         Attachments: health.check.service.1.patch
>
>
> NodeManager allows users to specify the health checker script that will be 
> invoked by the health-checker service via the configuration parameter, 
> "_yarn.nodemanager.health-checker.script.path_" 
> During the _serviceInit()_ of the health-check service, NM checks whether the 
> parameter is set correctly using _shouldRun()_, as follows,
> {code:title=/* NodeHealthCheckerService.java */|borderStyle=solid}
>   protected void serviceInit(Configuration conf) throws Exception {
>     if (NodeHealthScriptRunner.shouldRun(conf)) {
>       nodeHealthScriptRunner = new NodeHealthScriptRunner();
>       addService(nodeHealthScriptRunner);
>     }
>     addService(dirsHandler);
>     super.serviceInit(conf);
>   }
> {code}
> The problem is that if the parameter is misconfigured (e.g., permission 
> problem, wrong path), NM does not have any log message to inform users which 
> could cause latent errors or mysterious problems (e.g., "why my scripts does 
> not work?")
> I see the checking and printing logic is put in _serviceStart()_ function in 
> _NodeHealthScriptRunner.java_ (see the following code snippets). However, the 
> logic is very wrong. For an incorrect parameter that does not pass the 
> "shouldRun" check, _serviceStart()_ would never be called because the 
> _NodeHealthScriptRunner_ instance does not have the chance to be created (see 
> the code snippets above).
> {code:title=/* NodeHealthScriptRunner.java */|borderStyle=solid}
>   protected void serviceStart() throws Exception {
>     // if health script path is not configured don't start the thread.
>     if (!shouldRun(conf)) {
>       LOG.info("Not starting node health monitor");
>       return;
>     }
>     ... 
>   }  
> {code}
> Basically, I think the checking and printing logic should be put in the 
> serviceInit() in NodeHealthCheckerService instead of serviceStart() in 
> NodeHealthScriptRunner.
> See the attachment for the simple patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to