[jira] [Commented] (YARN-5635) Better handling when bad script is configured as Node's HealthScript

2016-11-07 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645146#comment-15645146
 ] 

Allen Wittenauer commented on YARN-5635:


OK, looks like I'm opening another JIRA issue so that -alpha2 release notes are 
correct.

> Better handling when bad script is configured as Node's HealthScript
> 
>
> Key: YARN-5635
> URL: https://issues.apache.org/jira/browse/YARN-5635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Allen Wittenauer
>
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole 
> cluster down because of a bad script. At the same time its important to 
> report that script is erroneous which is configured as node health script as 
> it might miss to detect bad health of a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5635) Better handling when bad script is configured as Node's HealthScript

2016-09-16 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497488#comment-15497488
 ] 

Allen Wittenauer commented on YARN-5635:


Back to something useful...

Is this going to get worked before -alpha2 comes out (which appears is going to 
go through RC soon)?  If not, we'll need to open Yet Another JIRA reporting on 
the revert so that it shows up in the release notes.

> Better handling when bad script is configured as Node's HealthScript
> 
>
> Key: YARN-5635
> URL: https://issues.apache.org/jira/browse/YARN-5635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Allen Wittenauer
>
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole 
> cluster down because of a bad script. At the same time its important to 
> report that script is erroneous which is configured as node health script as 
> it might miss to detect bad health of a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5635) Better handling when bad script is configured as Node's HealthScript

2016-09-14 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15490825#comment-15490825
 ] 

Vinod Kumar Vavilapalli commented on YARN-5635:
---

Sure, you can respond instead by not unilaterally reverting patches in the 
future. And everyone will be happy.

> Better handling when bad script is configured as Node's HealthScript
> 
>
> Key: YARN-5635
> URL: https://issues.apache.org/jira/browse/YARN-5635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Allen Wittenauer
>Assignee: Yufei Gu
>
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole 
> cluster down because of a bad script. At the same time its important to 
> report that script is erroneous which is configured as node health script as 
> it might miss to detect bad health of a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5635) Better handling when bad script is configured as Node's HealthScript

2016-09-13 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489040#comment-15489040
 ] 

Allen Wittenauer commented on YARN-5635:


I'd respond, but I don't believe in giving that type of feedback publicly is 
good etiquette.

> Better handling when bad script is configured as Node's HealthScript
> 
>
> Key: YARN-5635
> URL: https://issues.apache.org/jira/browse/YARN-5635
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Allen Wittenauer
>Assignee: Yufei Gu
>
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole 
> cluster down because of a bad script. At the same time its important to 
> report that script is erroneous which is configured as node health script as 
> it might miss to detect bad health of a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5635) Better handling when bad script is configured as Node's HealthScript

2016-09-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488977#comment-15488977
 ] 

Vinod Kumar Vavilapalli commented on YARN-5635:
---

I'm deliberately staying out of the technical details here though I do have 
strong opinions on this.

[~aw]

bq. YARN-5567 has already been reverted.
I saw YARN-5567 also where the patch was reverted even though there was no ack 
from the original contributor / committer.

Please don't do this.

Even if there are differing opinions, you should give a heads up, wait and then 
revert something. This isn't something that needs to be coded into the bylaws, 
it's basic etiquette. There is no reason for a unilateral revert without 
discussion.

bq. I'm going to -1 any patch that even thinks about treating the exit code as 
a way to mark the NM as unhealthy.
This isn't constructive tone either, you could have simply said it isn't the 
right solution and provided alternatives. By this tone, you are essentially 
shooing away interested volunteers from the project.



> Better handling when bad script is configured as Node's HealthScript
> 
>
> Key: YARN-5635
> URL: https://issues.apache.org/jira/browse/YARN-5635
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Allen Wittenauer
>Assignee: Yufei Gu
>
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole 
> cluster down because of a bad script. At the same time its important to 
> report that script is erroneous which is configured as node health script as 
> it might miss to detect bad health of a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5635) Better handling when bad script is configured as Node's HealthScript

2016-09-13 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487913#comment-15487913
 ] 

Naganarasimha G R commented on YARN-5635:
-

Sorry Missed the earlier comment from Allen (MAPREDUCE-6743), 
Ignore my earlier comment

What i meant was if the exit code is not zero then can we capture it as 
different health status rather than marking it as unhealthy

> Better handling when bad script is configured as Node's HealthScript
> 
>
> Key: YARN-5635
> URL: https://issues.apache.org/jira/browse/YARN-5635
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Allen Wittenauer
>Assignee: Yufei Gu
>
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole 
> cluster down because of a bad script. At the same time its important to 
> report that script is erroneous which is configured as node health script as 
> it might miss to detect bad health of a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5635) Better handling when bad script is configured as Node's HealthScript

2016-09-13 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487889#comment-15487889
 ] 

Allen Wittenauer commented on YARN-5635:


bq. does that hold true for even making it an option via a configuration 
setting?

Yes.

I don't know how many ways I can tell you that depending upon on an error code 
here is extremely dangerous and has proven to be unreliable due to the 
constantly shifting nature of the state of the node on busy clusters. Throw in 
all of this "magically expanding/shrinking" task resource management bits that 
have gone in, and the situation gets even worse.

Besides, if you REALLY REALLY REALLY want to do this, all you need to do is 
wrap your existing health check in something else that, upon failure, prints 
the ERROR message.  

> Better handling when bad script is configured as Node's HealthScript
> 
>
> Key: YARN-5635
> URL: https://issues.apache.org/jira/browse/YARN-5635
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Allen Wittenauer
>Assignee: Yufei Gu
>
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole 
> cluster down because of a bad script. At the same time its important to 
> report that script is erroneous which is configured as node health script as 
> it might miss to detect bad health of a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5635) Better handling when bad script is configured as Node's HealthScript

2016-09-13 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487878#comment-15487878
 ] 

Naganarasimha G R commented on YARN-5635:
-

[~rchiang], Unfortunately i reopened this jira and reworded almost about the 
same time. Sorry was not aware new jira got raised little earlier than this and 
thanks for closing it. you can go ahead and make it subtask of YARN-5078.

Well for your new approach it almost sounds like a incompatible change for 
existing node health scripts to define a new exit code. But is it required ? 
Existing code treats any exit code other zero as unsuccessful and reports it as 
{{HealthCheckerExitStatus.FAILED_WITH_EXIT_CODE}}. But 
{{HealthCheckerExitStatus.FAILED}} is thrown when the output of script as 
{{"ERROR"}} string in it.

So what we would want to address here would be, if the script output has errors 
or script gets timed out then how to handle better. In this case it would *not* 
be good to gracefully drain the NM directly, but to report that status could 
not be got from the NM properly through script. Any thoughts on my earlier 
comment 
{code}
NM can inform Healthy/UnHealthy/HealthValidationError, And this can be sent 
across Heartbeat to RM and RM can capture the state of this NM to be other than 
Running and UnHealthy (a New state). This can be displayed in the WebUI and 
also in the can be queried using ./yarn node -list -state
{code}

> Better handling when bad script is configured as Node's HealthScript
> 
>
> Key: YARN-5635
> URL: https://issues.apache.org/jira/browse/YARN-5635
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Allen Wittenauer
>Assignee: Yufei Gu
>
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole 
> cluster down because of a bad script. At the same time its important to 
> report that script is erroneous which is configured as node health script as 
> it might miss to detect bad health of a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5635) Better handling when bad script is configured as Node's HealthScript

2016-09-13 Thread Ray Chiang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487877#comment-15487877
 ] 

Ray Chiang commented on YARN-5635:
--

[~aw] does that hold true for even making it an option via a configuration 
setting?

> Better handling when bad script is configured as Node's HealthScript
> 
>
> Key: YARN-5635
> URL: https://issues.apache.org/jira/browse/YARN-5635
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Allen Wittenauer
>Assignee: Yufei Gu
>
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole 
> cluster down because of a bad script. At the same time its important to 
> report that script is erroneous which is configured as node health script as 
> it might miss to detect bad health of a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5635) Better handling when bad script is configured as Node's HealthScript

2016-09-13 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487860#comment-15487860
 ] 

Allen Wittenauer commented on YARN-5635:


bq. except the newly defined error code which will mark the NodeManager as 
UNHEALTHY

No exceptions.  There is zero guarantee that the exit code of the script is the 
one you're actually looking to catch.  For example, MAPREDUCE-6743 fixes a bug 
with the linking of nttest.  The exit code on that prior to the fix? 127. 

Let me be absolutely crystal clear:  I'm going to -1 any patch that even thinks 
about treating the exit code as a way to mark the NM as unhealthy.



> Better handling when bad script is configured as Node's HealthScript
> 
>
> Key: YARN-5635
> URL: https://issues.apache.org/jira/browse/YARN-5635
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Allen Wittenauer
>Assignee: Yufei Gu
>
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole 
> cluster down because of a bad script. At the same time its important to 
> report that script is erroneous which is configured as node health script as 
> it might miss to detect bad health of a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5635) Better handling when bad script is configured as Node's HealthScript

2016-09-13 Thread Ray Chiang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487835#comment-15487835
 ] 

Ray Chiang commented on YARN-5635:
--

And does anyone mind if I make this a subtask of YARN-5078?

> Better handling when bad script is configured as Node's HealthScript
> 
>
> Key: YARN-5635
> URL: https://issues.apache.org/jira/browse/YARN-5635
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Allen Wittenauer
>Assignee: Yufei Gu
>
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole 
> cluster down because of a bad script. At the same time its important to 
> report that script is erroneous which is configured as node health script as 
> it might miss to detect bad health of a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5635) Better handling when bad script is configured as Node's HealthScript

2016-09-13 Thread Ray Chiang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487832#comment-15487832
 ] 

Ray Chiang commented on YARN-5635:
--

I had captured something in a separate JIRA, but putting it here.  This was my 
attempt to sum up the discussion at the end of YARN-5567:

Done as a alternate design to YARN-5567. Define a specific exit code for the 
health checker script (property yarn.nodemanager.health-checker.script.path) 
that allows the node to be blacklisted.

As discussed in the latter part of YARN-5567, the current design requirements 
are:

# Ignore all exit codes from the script
## _except_ the newly defined error code which will mark the NodeManager as 
UNHEALTHY
## This allows any syntax or functional errors in the script to be ignored
# Upon failure (or multiple recorded failures):
## Store the status in the metrics2 state on the NodeManager
## Allow the RM to blacklist the NM or allow the jobs to drain, depending on 
how we want UNHEALTHY to be treated


> Better handling when bad script is configured as Node's HealthScript
> 
>
> Key: YARN-5635
> URL: https://issues.apache.org/jira/browse/YARN-5635
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Allen Wittenauer
>
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole 
> cluster down because of a bad script. At the same time its important to 
> report that script is erroneous which is configured as node health script as 
> it might miss to detect bad health of a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5635) Better handling when bad script is configured as Node's HealthScript

2016-09-13 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487639#comment-15487639
 ] 

Naganarasimha G R commented on YARN-5635:
-

One more comment from allen 
{code}
It needs to be available via metrics2, otherwise it's invisible to most large 
scale ops teams.
{code}

> Better handling when bad script is configured as Node's HealthScript
> 
>
> Key: YARN-5635
> URL: https://issues.apache.org/jira/browse/YARN-5635
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Allen Wittenauer
> Fix For: 3.0.0-alpha2
>
>
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole 
> cluster down because of a bad script. At the same time its important to 
> report that script is erroneous which is configured as node health script as 
> it might miss to detect bad health of a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5635) Better handling when bad script is configured as Node's HealthScript

2016-09-13 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487633#comment-15487633
 ] 

Naganarasimha G R commented on YARN-5635:
-

One of the approaches which was pointed in YARN-5567 : 
{code}
NM can inform Healthy/UnHealthy/HealthValidationError, And this can be sent 
across Heartbeat to RM and RM can capture the state of this NM to be other than 
Running and UnHealthy (a New state). This can be displayed in the WebUI and 
also in the can be queried using ./yarn node -list -state
{code}
more thoughts on this approach ? cc/ [~yufeigu] & [~rchiang]

> Better handling when bad script is configured as Node's HealthScript
> 
>
> Key: YARN-5635
> URL: https://issues.apache.org/jira/browse/YARN-5635
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Allen Wittenauer
> Fix For: 3.0.0-alpha2
>
>
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole 
> cluster down because of a bad script. At the same time its important to 
> report that script is erroneous which is configured as node health script as 
> it might miss to detect bad health of a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org