[ 
https://issues.apache.org/jira/browse/YARN-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tianyin Xu updated YARN-2832:
-----------------------------
    Description: 
NodeManager allows users to specify the health checker script that will be 
invoked by the health-checker service via the configuration parameter, 
"_yarn.nodemanager.health-checker.script.path_" 

During the _serviceInit()_ of the health-check service, NM checks whether the 
parameter is set correctly using _shouldRun()_, as follows,

{code:title=/* NodeHealthCheckerService.java */|borderStyle=solid}
  protected void serviceInit(Configuration conf) throws Exception {
    if (NodeHealthScriptRunner.shouldRun(conf)) {
      nodeHealthScriptRunner = new NodeHealthScriptRunner();
      addService(nodeHealthScriptRunner);
    }
    addService(dirsHandler);
    super.serviceInit(conf);
  }
{code}

The problem is that if the parameter is misconfigured (e.g., permission 
problem, wrong path), NM does not have any log message to inform users which 
could cause latent errors or mysterious problems (e.g., "why my scripts does 
not work?")

I see the checking and printing logic is put in _serviceStart()_ function in 
_NodeHealthScriptRunner.java_ (see the following code snippets). However, the 
logic is very wrong. For an incorrect parameter that does not pass the 
"shouldRun" check, _serviceStart()_ would never be called because the 
_NodeHealthScriptRunner_ instance does not have the chance to be created (see 
the code snippets above).

{code:title=/* NodeHealthScriptRunner.java */|borderStyle=solid}
  protected void serviceStart() throws Exception {
    // if health script path is not configured don't start the thread.
    if (!shouldRun(conf)) {
      LOG.info("Not starting node health monitor");
      return;
    }
    ... 
  }  
{code}

Basically, I think the checking and printing logic should be put in the 
serviceInit() in NodeHealthCheckerService instead of serviceStart() in 
NodeHealthScriptRunner.

See the attachment for the simple patch.




  was:
NodeManager allows users to specify the health checker script that will be 
invoked by the health-checker service via the configuration parameter, 
"_yarn.nodemanager.health-checker.script.path_" 

During the _serviceInit()_ of the health-check service, NM checks whether the 
parameter is set correctly using _shouldRun()_, as follows,

{code:title=/* NodeHealthCheckerService.java */|borderStyle=solid}
  protected void serviceInit(Configuration conf) throws Exception {
    if (NodeHealthScriptRunner.shouldRun(conf)) {
      nodeHealthScriptRunner = new NodeHealthScriptRunner();
      addService(nodeHealthScriptRunner);
    }
    addService(dirsHandler);
    super.serviceInit(conf);
  }
{code}

The problem is that if the parameter is misconfigured (e.g., permission 
problem, wrong path), NM does not have any log message to inform users which 
could cause latent errors or mysterious problems (e.g., "why my scripts does 
not work?")

I see the checking and printing logic is put in _serviceStart()_ function in 
_NodeHealthScriptRunner.java_. However, the logic is very wrong. For an 
incorrect parameter that does not pass the "shouldRun" check, _serviceStart()_ 
would never be called because the _NodeHealthScriptRunner_ instance does not 
have the chance to be created (see the code snippets above).

{code:title=/* NodeHealthScriptRunner.java */|borderStyle=solid}
  protected void serviceStart() throws Exception {
    // if health script path is not configured don't start the thread.
    if (!shouldRun(conf)) {
      LOG.info("Not starting node health monitor");
      return;
    }
    ... 
  }  
{code}

Basically, I think the checking and printing logic should be put in the 
serviceInit() in NodeHealthCheckerService instead of serviceStart() in 
NodeHealthScriptRunner.

See the attachment for the simple patch.





> Wrong Logic of Checking Causes Latent Errors
> --------------------------------------------
>
>                 Key: YARN-2832
>                 URL: https://issues.apache.org/jira/browse/YARN-2832
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.4.1, 2.5.1
>         Environment: Any environment
>            Reporter: Tianyin Xu
>         Attachments: health.check.service.1.patch
>
>
> NodeManager allows users to specify the health checker script that will be 
> invoked by the health-checker service via the configuration parameter, 
> "_yarn.nodemanager.health-checker.script.path_" 
> During the _serviceInit()_ of the health-check service, NM checks whether the 
> parameter is set correctly using _shouldRun()_, as follows,
> {code:title=/* NodeHealthCheckerService.java */|borderStyle=solid}
>   protected void serviceInit(Configuration conf) throws Exception {
>     if (NodeHealthScriptRunner.shouldRun(conf)) {
>       nodeHealthScriptRunner = new NodeHealthScriptRunner();
>       addService(nodeHealthScriptRunner);
>     }
>     addService(dirsHandler);
>     super.serviceInit(conf);
>   }
> {code}
> The problem is that if the parameter is misconfigured (e.g., permission 
> problem, wrong path), NM does not have any log message to inform users which 
> could cause latent errors or mysterious problems (e.g., "why my scripts does 
> not work?")
> I see the checking and printing logic is put in _serviceStart()_ function in 
> _NodeHealthScriptRunner.java_ (see the following code snippets). However, the 
> logic is very wrong. For an incorrect parameter that does not pass the 
> "shouldRun" check, _serviceStart()_ would never be called because the 
> _NodeHealthScriptRunner_ instance does not have the chance to be created (see 
> the code snippets above).
> {code:title=/* NodeHealthScriptRunner.java */|borderStyle=solid}
>   protected void serviceStart() throws Exception {
>     // if health script path is not configured don't start the thread.
>     if (!shouldRun(conf)) {
>       LOG.info("Not starting node health monitor");
>       return;
>     }
>     ... 
>   }  
> {code}
> Basically, I think the checking and printing logic should be put in the 
> serviceInit() in NodeHealthCheckerService instead of serviceStart() in 
> NodeHealthScriptRunner.
> See the attachment for the simple patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to