[
https://issues.apache.org/jira/browse/YARN-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tianyin Xu updated YARN-2832:
-----------------------------
Summary: Wrong Check Logic of Causes Latent Errors (was: Wrong Logic of
Checking Causes Latent Errors)
> Wrong Check Logic of Causes Latent Errors
> ------------------------------------------
>
> Key: YARN-2832
> URL: https://issues.apache.org/jira/browse/YARN-2832
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.4.1, 2.5.1
> Environment: Any environment
> Reporter: Tianyin Xu
> Attachments: health.check.service.1.patch
>
>
> NodeManager allows users to specify the health checker script that will be
> invoked by the health-checker service via the configuration parameter,
> "_yarn.nodemanager.health-checker.script.path_"
> During the _serviceInit()_ of the health-check service, NM checks whether the
> parameter is set correctly using _shouldRun()_, as follows,
> {code:title=/* NodeHealthCheckerService.java */|borderStyle=solid}
> protected void serviceInit(Configuration conf) throws Exception {
> if (NodeHealthScriptRunner.shouldRun(conf)) {
> nodeHealthScriptRunner = new NodeHealthScriptRunner();
> addService(nodeHealthScriptRunner);
> }
> addService(dirsHandler);
> super.serviceInit(conf);
> }
> {code}
> The problem is that if the parameter is misconfigured (e.g., permission
> problem, wrong path), NM does not have any log message to inform users which
> could cause latent errors or mysterious problems (e.g., "why my scripts does
> not work?")
> I see the checking and printing logic is put in _serviceStart()_ function in
> _NodeHealthScriptRunner.java_ (see the following code snippets). However, the
> logic is very wrong. For an incorrect parameter that does not pass the
> "shouldRun" check, _serviceStart()_ would never be called because the
> _NodeHealthScriptRunner_ instance does not have the chance to be created (see
> the code snippets above).
> {code:title=/* NodeHealthScriptRunner.java */|borderStyle=solid}
> protected void serviceStart() throws Exception {
> // if health script path is not configured don't start the thread.
> if (!shouldRun(conf)) {
> LOG.info("Not starting node health monitor");
> return;
> }
> ...
> }
> {code}
> Basically, I think the checking and printing logic should be put in the
> serviceInit() in NodeHealthCheckerService instead of serviceStart() in
> NodeHealthScriptRunner.
> See the attachment for the simple patch.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)