[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-28 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203473#comment-17203473
 ] 

Eric Payne commented on YARN-9809:
--

I have committed this to branch-3.3 and branch-3.2. It looks like there is some 
additional work necessary if we want this to be backported to 3.1. I, for one, 
don't think that is necessary, but please comment if you disagree.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, 
> YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, 
> YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, 
> YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, 
> YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-28 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203346#comment-17203346
 ] 

Eric Payne commented on YARN-9809:
--

The latest branch-3.2 precommit build looks fine. The unit test failures are 
the same ones that are failing on branch-3.2 without the patch _except_ 
{{TestRaceWhenRelogin}}, which is not failing for me in my local build with or 
without the patch.

+1. I will commit this today.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, 
> YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, 
> YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, 
> YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, 
> YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-25 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202448#comment-17202448
 ] 

Hadoop QA commented on YARN-9809:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 21m 
33s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
1s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:blue}0{color} | {color:blue} buf {color} | {color:blue}  0m  0s{color} 
| {color:blue}{color} | {color:blue} buf was not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 
18 new or modified test files. {color} |
|| || || || {color:brown} branch-3.2 Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  3m 
35s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for 
branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 33m 
37s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 23m 
16s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  4m 
38s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  7m 
29s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
29m 38s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  5m 
41s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m 
38s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs 
config; considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 14m 
31s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
32s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for 
patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
 7s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 22m  
4s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 22m  
4s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 22m  
4s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
4m 16s{color} | 
{color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/195/artifact/out/diff-checkstyle-root.txt{color}
 | {color:orange} root: The patch generated 3 new + 1258 unchanged - 1 fixed = 
1261 total (was 1259) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  7m  
5s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green}{color} | {color:green} The patch has no ill-formed 
XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 24s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  4m 
59s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 11m 
42s{color} | {color:green}{color} | 

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-25 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202390#comment-17202390
 ] 

Eric Payne commented on YARN-9809:
--

Version 009 LGTM. +1

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, 
> YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, 
> YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, 
> YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, 
> YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-25 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202388#comment-17202388
 ] 

Eric Payne commented on YARN-9809:
--

Thanks a lot, [~ebadger] for the backport, and thank you [~Jim_Brennan] for the 
great reviews.

I have verified that the following unit tests are also failing in branch-3.2:
{noformat}
TestYarnConfigurationFields
TestZKConfigurationStore
TestSystemMetricsPublisherForV2
TestFSSchedulerConfigurationStore
TestCombinedSystemMetricsPublisher
{noformat}

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, 
> YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, 
> YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, 
> YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, 
> YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-25 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202352#comment-17202352
 ] 

Jim Brennan commented on YARN-9809:
---

Thanks for fixing the test [~ebadger]!
+1 for YARN-9809-branch-3.2.009.patch


> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, 
> YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, 
> YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, 
> YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, 
> YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-25 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202314#comment-17202314
 ] 

Eric Badger commented on YARN-9809:
---

So close. Those pesky unit tests. Patch 009 fixes the unit test failure. Thanks 
for the review, [~Jim_Brennan]!

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, 
> YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, 
> YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, 
> YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, 
> YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-25 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202289#comment-17202289
 ] 

Jim Brennan commented on YARN-9809:
---

Thanks [~ebadger] for the updated patch!  I am +1 on 
YARN-9809-branch-3.2.008.patch except for this one failing unit test:

TestYarnConfigurationFields
{noformat}
2020-09-25 11:46:46,358 ERROR conf.TestConfigurationFieldsBase 
(TestConfigurationFieldsBase.java:testCompareConfigurationClassAgainstXml(485)) 
- class org.apache.hadoop.yarn.conf.YarnConfiguration has 1 variables missing 
in yarn-default.xml
2020-09-25 11:46:46,359 INFO  conf.TestConfigurationFieldsBase 
(TestConfigurationFieldsBase.java:lambda$appendMissingEntries$1(507)) -   
yarn.nodemanager.health-checker.run-before-startup
 {noformat}


> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, 
> YARN-9809-branch-3.2.008.patch, YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-24 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201900#comment-17201900
 ] 

Hadoop QA commented on YARN-9809:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 12m 
34s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
1s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:blue}0{color} | {color:blue} buf {color} | {color:blue}  0m  0s{color} 
| {color:blue}{color} | {color:blue} buf was not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 
18 new or modified test files. {color} |
|| || || || {color:brown} branch-3.2 Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  3m 
26s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for 
branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 29m 
27s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 20m 
26s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  3m 
54s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  5m 
15s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
27m  6s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  4m 
13s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m 
16s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs 
config; considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 10m 
21s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
28s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for 
patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
30s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 20m  
9s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 20m  
9s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 20m  
9s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
3m 50s{color} | 
{color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/194/artifact/out/diff-checkstyle-root.txt{color}
 | {color:orange} root: The patch generated 3 new + 1258 unchanged - 1 fixed = 
1261 total (was 1259) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  5m 
22s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 48s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
57s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 11m 
20s{color} | {color:green}{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} || ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 11m  
2s{color} 

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-24 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201814#comment-17201814
 ] 

Eric Badger commented on YARN-9809:
---

I've attached branch-3.2 patch 008 to address your comments, [~Jim_Brennan]. I 
think I got all of the unit tests to pass. But 
TestCombinedSystemMetricsPublisher, TestSystemMetricsPublisherForV2, 
TestFSSchedulerConfigurationStore, and TestZKConfigurationStore failed for me 
locally on straight up branch-3.2

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, 
> YARN-9809-branch-3.2.008.patch, YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-24 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201782#comment-17201782
 ] 

Eric Badger commented on YARN-9809:
---

{noformat}
RMNodeImpl#AddNodeTransition#transition
RMNodeStatusEvent rmNodeStatusEvent =
new RMNodeStatusEvent(nodeId, nodeStatus);

NodeHealthStatus nodeHealthStatus =
updateRMNodeFromStatusEvents(rmNode, rmNodeStatusEvent);

if (nodeHealthStatus.getIsNodeHealthy()) {
{noformat}
bq. Do we run the risk of nodeHealthStatus being null?

[~epayne], nope we should be fine here. {{nodeHealthStatus}} comes from the 
return value of {{updateRMNodeFromStatusEvents}}. The return value of that 
method comes from {{statusEvent.getNodeHealthStatus()}}. But {{statusEvent}} is 
passed into this method via an argument. On the caller side that argument is 
named {{rmNodeStatusEvent}} and it is craeted a few lines up via the 
RMNodeStatusEvent constructor. The {{nodeStatus}} is set there via the 
constructor and we know it won't be null because we are in the "else" of the 
"if" statement that checked for {{nodeStatus}} being null.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, 
> YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, 
> YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-23 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201136#comment-17201136
 ] 

Jim Brennan commented on YARN-9809:
---

I finished a first pass.  Here are my comments:

NodeHealthScriptRunner
* Need to add code to Nodemanager to get the runBeforeStartup conf and pass it 
to constructor.
* Need to make startup run optional based on runBeforeStartup

RegisterNodeManagerRequest
* See the trunk version of the patch. You should only have to add the new 
parameter to the last newInstance() interface, and have the second to last pass 
null.
* This might reduce the number of tests you need to modify.

RMNodeImpl
* addNodeTransition - I think this line should this line be removed?
{noformat}
// Increment activeNodes explicitly because this is a new node.
ClusterMetrics.getMetrics().incrNumActiveNodes();
{noformat}
* updateMetricsForRejoinedNode - think we need to remove 
metrics.incrNumActiveNodes();

TestRMNodeTransitions
* new testAddUnhealthyNode() test is not here

These should not be needed if you fix constructors for 
RegisterNodeManagerRequest
* TestProtocolRecords
* TestRegisterNodeManagerRequest
* TestResourceTrackerOnHA
* TestYarnServerApiClasses


> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, 
> YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, 
> YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-23 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201116#comment-17201116
 ] 

Eric Badger commented on YARN-9809:
---

Thanks for the initial reviews, [~epayne] and [~Jim_Brennan]! I will put up an 
updated patch soon with changes related to your comments. I also noticed some 
other issues that are manifesting as the unit test failures. So I will fix 
those as well.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, 
> YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, 
> YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-23 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201098#comment-17201098
 ] 

Jim Brennan commented on YARN-9809:
---

Thanks [~ebadger] for putting up a branch-3.2 patch!  I am still reading, but 
wanted to make this initial comment:  This patch does not include the config 
parameter {{NM_HEALTH_CHECK_RUN_BEFORE_STARTUP}} to make this an opt-in feature.


> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, 
> YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, 
> YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-23 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201061#comment-17201061
 ] 

Eric Payne commented on YARN-9809:
--

Thanks a lot [~ebadger] for putting upt the 3.2 backport patch. I'm still going 
through it, but I had one question after my first pass:
{code:java|title=RMNodeImpl#AddNodeTransition#transition}
RMNodeStatusEvent rmNodeStatusEvent =
new RMNodeStatusEvent(nodeId, nodeStatus);

NodeHealthStatus nodeHealthStatus =
updateRMNodeFromStatusEvents(rmNode, rmNodeStatusEvent);

if (nodeHealthStatus.getIsNodeHealthy()) {
{code}

Do we run the risk of {{nodeHealthStatus}} being null?

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, 
> YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, 
> YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-21 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199792#comment-17199792
 ] 

Hadoop QA commented on YARN-9809:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 22m 
27s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
1s{color} | {color:green} No case conflicting files found. {color} |
| {color:blue}0{color} | {color:blue} buf {color} | {color:blue}  0m  0s{color} 
| {color:blue} buf was not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 21 new or modified test 
files. {color} |
|| || || || {color:brown} branch-3.2 Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
20s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
59s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 19m 
36s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  3m 
14s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m 
11s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
21m 46s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
11s{color} | {color:green} branch-3.2 passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  0m 
53s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  6m 
55s{color} | {color:green} branch-3.2 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
19s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 14m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 14m 
24s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
2m 53s{color} | {color:orange} root: The patch generated 4 new + 1035 unchanged 
- 1 fixed = 1039 total (was 1036) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 25s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  7m 
37s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  9m 
14s{color} | {color:green} hadoop-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
37s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m 
13s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 78m 18s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 25m  
8s{color} | {color:green} hadoop-yarn-client in the patch passed. {color} 

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-21 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199707#comment-17199707
 ] 

Eric Badger commented on YARN-9809:
---

[~epayne], [~Jim_Brennan], sorry for the delay. I have put up a patch for 
branch-3.2. However I think this needs another round of review because the diff 
was quite massive on the cherry-pick and I had to redo a lot of stuff by hand. 
So in a lot of ways, this is a completely new patch. I think I got all of the 
unit tests that would've failed, but we'll see what HadoopQA says.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, 
> YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, 
> YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-04 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190724#comment-17190724
 ] 

Jim Brennan commented on YARN-9809:
---

No objection to backporting.


> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-03 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190414#comment-17190414
 ] 

Eric Payne commented on YARN-9809:
--

[~ebadger], this doesn't backport cleanly to 3.2. Would you mind taking a look 
when you have a chance?

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-03 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190392#comment-17190392
 ] 

Eric Payne commented on YARN-9809:
--

Unless there are objections, I would like to backport this to 3.1.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-30 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148914#comment-17148914
 ] 

Hudson commented on YARN-9809:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18394 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18394/])
YARN-9809. Added node manager health status to resource manager (eyang: rev 
e8dc862d3856e9eaea124c625dade36f1dd53fe2)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/NodeManager.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/health/TimedHealthReporterService.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/resourcetracker/TestNMReconnect.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestEventFlow.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/logaggregationstatus/TestRMAppLogAggregationStatus.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerHealth.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/health/NodeHealthScriptRunner.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStartedEvent.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesNodes.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/BaseContainerManagerTest.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/scheduler/TestContainerSchedulerQueuing.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManager.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java
* (edit) 

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-30 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148906#comment-17148906
 ] 

Eric Badger commented on YARN-9809:
---

Thanks, [~eyang] for the review and commit and [~Jim_Brennan] for the review!

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-30 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148829#comment-17148829
 ] 

Eric Badger commented on YARN-9809:
---

Thanks for the review, [~eyang]! Are you planning on committing this or would 
you like me to?

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-29 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148066#comment-17148066
 ] 

Eric Yang commented on YARN-9809:
-

+1 for patch 007.  Tested both healthy and unhealthy health check scripts in my 
limited 1 node environment.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-26 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146522#comment-17146522
 ] 

Eric Badger commented on YARN-9809:
---

Thanks, [~Jim_Brennan]! [~eyang], would you take another look?

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-26 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146402#comment-17146402
 ] 

Jim Brennan commented on YARN-9809:
---

Thanks for the update [~ebadger]!

I am +1 (non-binding) on patch 007.

 

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-25 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145981#comment-17145981
 ] 

Hadoop QA commented on YARN-9809:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 21m 
36s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:blue}0{color} | {color:blue} prototool {color} | {color:blue}  0m  
0s{color} | {color:blue} prototool was not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 20 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  3m 
27s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 
19s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
 0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
21m  3s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
53s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
54s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  8m 
36s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
57s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  7m 
30s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  7m 30s{color} 
| {color:red} hadoop-yarn-project_hadoop-yarn generated 1 new + 334 unchanged - 
0 fixed = 335 total (was 334) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
52s{color} | {color:green} hadoop-yarn-project/hadoop-yarn: The patch generated 
0 new + 1231 unchanged - 3 fixed = 1231 total (was 1234) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 12s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
40s{color} | {color:red} 
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-common 
generated 11 new + 89 unchanged - 11 fixed = 100 total (was 100) {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  8m 
56s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m  
8s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m 
16s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
54s{color} | {color:green} hadoop-yarn-server-common in the patch 

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-25 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145904#comment-17145904
 ] 

Hadoop QA commented on YARN-9809:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  2m 
19s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:blue}0{color} | {color:blue} prototool {color} | {color:blue}  0m  
0s{color} | {color:blue} prototool was not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
1s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 20 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
27s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 
37s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
 3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
25m  3s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
43s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m 
21s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 10m 
10s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
26s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 10m 
25s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 10m 25s{color} 
| {color:red} hadoop-yarn-project_hadoop-yarn generated 1 new + 334 unchanged - 
0 fixed = 335 total (was 334) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
56s{color} | {color:green} hadoop-yarn-project/hadoop-yarn: The patch generated 
0 new + 1232 unchanged - 3 fixed = 1232 total (was 1235) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
18m  9s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 10m 
33s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m  
0s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m 
18s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
54s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 22m 
27s{color} | 

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-25 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145893#comment-17145893
 ] 

Eric Badger commented on YARN-9809:
---

Good catch, [~Jim_Brennan]. {{updateMetricsForRejoinedNode()}} is only called 
in one other place and I don't want to add the node and then remove it again. 
So I removed the increment from {{updateMetricsForRejoinedNode()}} and 
explicitly added it to just before the other place where 
{{updateMetricsForRejoinedNode()}} is called.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-25 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145869#comment-17145869
 ] 

Jim Brennan commented on YARN-9809:
---

Thanks for the updates [~ebadger]!   I have one comment on the new patch:

RMNodeImpl
* I think there's a bug from moving the call to 
\{{ClusterMetrics.getMetrics().incrNumActiveNodes()}}.  If previousRMNode != 
null (in the first check), we call \{{rmNode.updateMetricsForRejoinedNode()}}, 
which decrements the counter for the previous state and increments num active 
nodes. With your change, we now increment active nodes again when we call 
reportNodeRunning.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-25 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145717#comment-17145717
 ] 

Eric Badger commented on YARN-9809:
---

The TestFairScheduler and TestFairSchedulerPreemption test failures are 
unrelated to this JIRA as they have also been reported in 
https://issues.apache.org/jira/browse/YARN-10329

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-25 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145711#comment-17145711
 ] 

Eric Badger commented on YARN-9809:
---

Patch 006 moves {{ClusterMetrics.getMetrics().incrNumActiveNodes();}} into 
{{reportNodeRunning}} inside of the addNodeTransition. This fixes the failing 
unit test and prevents a scenario where we add an unhealthy node as RUNNING and 
then quickly switching it to UNHEALTHY. This way we go straight to UNHEALTHY.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-24 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17144632#comment-17144632
 ] 

Hadoop QA commented on YARN-9809:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
16s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
1s{color} | {color:green} No case conflicting files found. {color} |
| {color:blue}0{color} | {color:blue} prototool {color} | {color:blue}  0m  
0s{color} | {color:blue} prototool was not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 20 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m  
2s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m  
9s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
21m 46s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
25s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
49s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  8m 
16s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
22s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
 8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  8m  
3s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  8m  3s{color} 
| {color:red} hadoop-yarn-project_hadoop-yarn generated 1 new + 334 unchanged - 
0 fixed = 335 total (was 334) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 41s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 2 new + 1232 unchanged - 3 fixed = 1234 total (was 1235) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 43s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  8m 
50s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
59s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m  
1s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
40s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 22m 
35s{color} | 

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-24 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17144544#comment-17144544
 ] 

Eric Badger commented on YARN-9809:
---

Thanks for the review, [~Jim_Brennan]! I've uploaded patch 005 to fix the 
things you mentioned in your comments

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-19 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140841#comment-17140841
 ] 

Jim Brennan commented on YARN-9809:
---

Thanks for the patch [~ebadger]!  Overall I think the design and code look good.
Here are some comments:

MockNM
- line 196 - Isn't this loop that removes completedContainers a no-op?
{noformat}
ArrayList completedContainers = new ArrayList();
status.setContainersStatuses(
new ArrayList(containerStats.values()));
for (ContainerId cid : completedContainers) {
  containerStats.remove(cid);
}
{noformat}

MockRM
- This code is repeated in a lot of tests. Maybe we could add a function 
somewhere that does this so we can just pass getMockNodeStatus() instead?
TestAbstractYarnScheduler, TestCapacityScheduler, testFairScheduler, 
TestFifoScheduler, TestNMExpiry, TestNMReconnect, TestResourceManager, 
TestRMAppLogAggregationStatus, TestRMNodeTransitions, TestRMWebServicesNodes, 
TestSchedulerHealth, 
{noformat}
NodeStatus mockNodeStatus = mock(NodeStatus.class);
NodeHealthStatus mockNodeHealthStatus = mock(NodeHealthStatus.class);
when(mockNodeStatus.getNodeHealthStatus()).thenReturn(mockNodeHealthStatus);
when(mockNodeHealthStatus.getIsNodeHealthy()).thenReturn(true);
{noformat}

RMAppManager
- This looks like an accidental edit:
{noformat}
// Escape YarnServerCommonServiceProtossequences
{noformat}

RMNodeImpl
- line 894 Don't we have to deal with the possibility that nodeStatus is null 
here?  Seems like that is a possibilty.  I think null nodeStatus should be 
treated as healthy.  The RegisterNodeManagerRequest constructors that pass
null is what made me think this is necessary?
{noformat}
  NodeStatus nodeStatus =
  startEvent.getNodeStatus();
  RMNodeStatusEvent rmNodeStatusEvent =
  new RMNodeStatusEvent(nodeId, nodeStatus);

  NodeHealthStatus nodeHealthStatus =
  updateRMNodeFromStatusEvents(rmNode, rmNodeStatusEvent);

  NodeState nodeState = null;
  if (nodeHealthStatus.getIsNodeHealthy()) {
{noformat}
- In the case where the node is unhealthy, can we just call 
reportNodeUnusable() 
instead of
{noformat}
rmNode.context.getDispatcher().getEventHandler().handle(
new NodesListManagerEvent(
NodesListManagerEventType.NODE_UNUSABLE, rmNode));
//Update the metrics
rmNode.updateMetricsForDeactivatedNode(NodeState.RUNNING,
NodeState.UNHEALTHY);
{noformat}

TestRMNodeTransitions
- Maybe add a testAddUnhealthy here?

TimedHealthReporterService
- Do we need to be concerned about someone who might have their own 
implementation of TimedHealthReporterService?  Should we maintain a constructor 
that takes two args and passes null for runBeforeStartup?


> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-18 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139516#comment-17139516
 ] 

Jim Brennan commented on YARN-9809:
---

I have started reviewing the patch, but I will need more time.  I hope to 
finish today or tomorrow.


> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-18 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139511#comment-17139511
 ] 

Eric Yang commented on YARN-9809:
-

[~ebadger] [~Jim_Brennan] I agree that health check script handling is 
separated from register health check status.  +1 on patch 004.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-18 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139451#comment-17139451
 ] 

Jim Brennan commented on YARN-9809:
---

[~eyang], [~ebadger] changing the behavior of health-check scripts seems pretty 
dangerous.  We looked into this issue a few years ago, because we had some 
cases where the health-check scripts were not installed properly, and some bad 
nodes were erroneously reporting healthy status.

Rather than try to change the contract for how health-check scripts behave, 
which has been around for a very long time, we instead added a wrapper script 
that we ship with hadoop.  The wrapper checks that the real health-check script 
exists and is executable, and if it's not, it prints an "ERROR" message so the 
NM will mark the node unhealthy.  If the health-check script is good, we just 
exec it.

I agree that changing the handling of health check script output/return value 
is beyond the scope of this Jira.


> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-17 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138899#comment-17138899
 ] 

Eric Badger commented on YARN-9809:
---

I can see pros and cons to both approaches. On the one hand, if the health 
check script fails to execute properly, that's not good and could imply 
something bad. But health check scripts are pretty dangerous since they can 
take out an entire cluster if they're written improperly. So if someone updates 
the script and all of a sudden the script errors out, the whole cluster is 
unhealthy. Or the health check script could rely on querying a service and that 
service times out. The node is healthy, but the health check script returned 
error. Unless you are parsing for specific error codes, you can no longer 
differentiate between the health check script failing internally and the health 
check script returning successfully that the node is unhealthy. 

Regardless of this discussion though, this is outside of the scope of this 
JIRA. That's an issue with how the health check script is handled while this 
JIRA is just about providing a health status at NM startup

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-17 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138894#comment-17138894
 ] 

Eric Yang commented on YARN-9809:
-

[~ebadger] Sorry, my statement was not clear.  If the script name is incorrect, 
resulting exit code is non-zero, or the execution exit code is non-zero.  In 
those cases, health check will report as healthy.  I think those conditions 
must be considered as unhealthy, in the event that check script does not have 
proper prerequisites.  The errors can be caught.  Is this something that we can 
fix to make this more user friendly?

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-17 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138892#comment-17138892
 ] 

Eric Badger commented on YARN-9809:
---

{noformat:title=NodeHealthScriptRunner.newInstance()}
if (!shouldRun(scriptName, nodeHealthScript)) {
  return null;
}
{noformat}

{noformat:title=NodeHealthScriptRunner.shouldRun()}
  static boolean shouldRun(String script, String healthScript) {
if (healthScript == null || healthScript.trim().isEmpty()) {
  LOG.info("Missing location for the node health check script \"{}\".",
  script);
  return false;
}
{noformat}

If the health check script doesn't exist, then the health {{shouldRun}} will 
return false and the {{newInstance}} will return null. This will cause the 
health reporter to not be added as a service. So at the end of the day, your 
statement is correct. If the health check script doesn't exist, the node will 
report as healthy.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-17 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138853#comment-17138853
 ] 

Eric Yang commented on YARN-9809:
-

[~Jim_Brennan] Thank you for the instruction.  I updated my check script 
accordingly to:

{code}
#!/bin/bash
echo "ERROR test"
{code}

This works.  The script must return 0 exit code to work as well, otherwise, it 
will report as healthy.  This implies, if the health check script doesn't 
exist, it reports as healthy.  Is this right?


> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-17 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138833#comment-17138833
 ] 

Jim Brennan commented on YARN-9809:
---

[~eyang] I believe the health check script output must contain a line that 
begins with the string "ERROR" for the node to be marked as unhealthy.  The 
exit code does not have any effect.  

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-17 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138685#comment-17138685
 ] 

Eric Yang commented on YARN-9809:
-

[~ebadger] Thank you for the patch.  The patch looks very close to final 
product.  I have confirmed the test case failure doesn't happen, if there are 
sufficient amount of RAM on the testing node.  I also validated that new node 
manager can work with unpatched resource manager.  However, I could not get 
health check script to fail to cause node registered as unhealthy.

Here is my check script:
{code}
#!/bin/bash
echo "i am here" > /tmp/hello
exit 1
{code}

It would be nice to have verbose message to show the exit code of the health 
check script in node manager log file.  The script is executed, but it shows 
healthy.  What am I doing wrong?

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-16 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137934#comment-17137934
 ] 

Eric Badger commented on YARN-9809:
---

{noformat}
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesSchedulerActivities
hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
{noformat}
Neither of these tests fail for me locally and are unrelated to the changes 
made in patch 004. 

Both the javac and the javadoc errors are coming from generated protobuf java 
files. I don't know how to get rid of these errors, but they aren't introducing 
any warnings that don't already exist. I think they're fine. The generation of 
the java files is the issue here.

[~Jim_Brennan], [~ccondit], [~eyang], could you guys review patch 004?

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-15 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136194#comment-17136194
 ] 

Hadoop QA commented on YARN-9809:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m  
5s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:blue}0{color} | {color:blue} prototool {color} | {color:blue}  0m  
0s{color} | {color:blue} prototool was not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 20 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
16s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
 4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m 
33s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
20m 43s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
45s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m 
10s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  9m  
8s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
28s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  9m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  9m 
47s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  9m 47s{color} 
| {color:red} hadoop-yarn-project_hadoop-yarn generated 1 new + 334 unchanged - 
0 fixed = 335 total (was 334) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
58s{color} | {color:green} hadoop-yarn-project/hadoop-yarn: The patch generated 
0 new + 1255 unchanged - 3 fixed = 1255 total (was 1258) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m 18s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
39s{color} | {color:red} 
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-common 
generated 1 new + 99 unchanged - 1 fixed = 100 total (was 100) {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 10m 
33s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
13s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m 
34s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m  
8s{color} | {color:green} hadoop-yarn-server-common in the patch 

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-15 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136095#comment-17136095
 ] 

Eric Badger commented on YARN-9809:
---

Patch 004 fixes checkstyle. There is still the javac error with PARSER being 
deprecated, but I don't know how to get rid of that. It is coming from a 
generated proto file. So I'm not quite sure what to do about that. The PARSER 
is used in many other places within the same generated file

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-12 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134646#comment-17134646
 ] 

Hadoop QA commented on YARN-9809:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 24m 
34s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:blue}0{color} | {color:blue} prototool {color} | {color:blue}  0m  
0s{color} | {color:blue} prototool was not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 17 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
21s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m  
7s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
21m 52s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
31s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
53s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  9m 
10s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
24s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  8m 
23s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  8m 23s{color} 
| {color:red} hadoop-yarn-project_hadoop-yarn generated 1 new + 334 unchanged - 
0 fixed = 335 total (was 334) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 39s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 4 new + 1153 unchanged - 2 fixed = 1157 total (was 1155) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m  4s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m  
8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  8m 
46s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
58s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m  
3s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
40s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 22m  4s{color} 
| {color:red} 

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-08 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128817#comment-17128817
 ] 

Hadoop QA commented on YARN-9809:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 31m 
30s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:blue}0{color} | {color:blue} prototool {color} | {color:blue}  0m  
0s{color} | {color:blue} prototool was not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 6 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
16s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 11m  
9s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  5m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
24m 29s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
51s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m 
18s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 10m  
5s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
42s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 10m 
53s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 10m 53s{color} 
| {color:red} hadoop-yarn-project_hadoop-yarn generated 1 new + 334 unchanged - 
0 fixed = 335 total (was 334) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
52s{color} | {color:green} hadoop-yarn-project/hadoop-yarn: The patch generated 
0 new + 484 unchanged - 1 fixed = 484 total (was 485) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m 44s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 11m 
22s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
11s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m 
25s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
59s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 14m 34s{color} 
| {color:red} 

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-08 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128686#comment-17128686
 ] 

Eric Badger commented on YARN-9809:
---

Attaching patch 002 to address unit test failures

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-06 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127215#comment-17127215
 ] 

Hadoop QA commented on YARN-9809:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
29s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:blue}0{color} | {color:blue} prototool {color} | {color:blue}  0m  
0s{color} | {color:blue} prototool was not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
24s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
18s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m 
19s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
20m 37s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
21s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
49s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  6m 
20s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
23s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  8m  
4s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  8m  4s{color} 
| {color:red} hadoop-yarn-project_hadoop-yarn generated 1 new + 334 unchanged - 
0 fixed = 335 total (was 334) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 34s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 5 new + 474 unchanged - 1 fixed = 479 total (was 475) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 40s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  6m 
46s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 59s{color} 
| {color:red} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
39s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 13m 11s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}345m 10s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
43s{color} 

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-05 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127025#comment-17127025
 ] 

Eric Badger commented on YARN-9809:
---

Patch 001 adds the feature but makes it opt-in via the config 
{{yarn.nodemanager.health-checker.run-before-startup}}. I didn't put in the 
retries flag for shutting down the NM if there are a certain number of 
failures. I can do that in a subsequent patch if you'd like. But I tested this 
patch out and it seems to work.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-05-18 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110361#comment-17110361
 ] 

Jim Brennan commented on YARN-9809:
---

[~ccondit] I agree that a config to allow one to opt-in to this feature is a 
good idea.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-05-18 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110356#comment-17110356
 ] 

Craig Condit commented on YARN-9809:


Since health check scripts are by nature different for every deployment, it 
seems that neither the current behavior nor what is proposed makes sense in all 
cases. However, I believe there can be some middle ground. I propose we keep 
the existing behavior as default to avoid causing pain for existing users, but 
allow a configuration to opt-in to a single synchronous execution of a health 
check on startup before node check-in (controlled via a new 
*{{yarn.nodemanager.health-checker.preflight.enabled}}* boolean configuration). 
It may also be desirable to kill the NM upon repeated failures of this 
preflight check. We could add a new config 
*{{yarn.nodemanager.health-checker.preflight.retries}}* to control the number 
of retries before aborting the NM (or -1 for infinite).

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-05-15 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108617#comment-17108617
 ] 

Eric Yang commented on YARN-9809:
-

[~Jim_Brennan] This feature is a great addition to make admin task easier for 
large scale cluster.  What is the latency that we are talking about in 
health-check script?  If it is a few seconds and less, I agree that there is 
marginal difference  in startup time, and potential benefit is great.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-05-15 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108576#comment-17108576
 ] 

Jim Brennan commented on YARN-9809:
---

I would like to revive this discussion.  We have this implemented internally.   
Our health-check script runs very quickly, so the impact on the time it takes 
to register with the RM is minimal (not really noticeable in our case).   Our 
health-check script does a number of checks to validate the health of the host 
the NM is running on.  We don't have any checks directly related to 
success/failure of containers launching since the last check, but even if we 
did, that particular check just wouldn't find anything if no containers have 
been launched yet.

The cases that we were trying to address with this change involve hardware or 
os issues with the node that may not prevent a container from launching, but 
are serious enough to mark the node as unhealthy (memory/disk errors, etc...).  
 We have seen this during a rolling upgrade.  Nodes that had been previously 
marked as unhealthy would be brought up as part of the RU, and those nodes 
would start running containers only to be marked unhealthy 10 minutes later 
when the health-check script ran.   This caused a lot of killed task attempts.  
 With large clusters there can be hundreds of nodes that are unhealthy, so 
there can be a lot of failed task attempts.

It seems the main question that [~eyang] is raising is whether we should allow 
a synchronous call to run the health-check script during nodemanager 
startup/registration.  I agree that this can introduce a potential slowdown if 
the health-check-script is slow.   In our case, the delay is not noticeable, 
and we think it is worth it to prevent the false start.   What do others think?

cc: [~ebadger], [~eyang], [~ccondit-target], [~shaneku...@gmail.com]

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2019-09-04 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922542#comment-16922542
 ] 

Eric Badger commented on YARN-9809:
---

bq. Although it is good to have a way to prevent scheduling containers to a 
node manager that is going through registration process to save network round 
trips and compute resources, the existing async design allows the node to show 
up in Resource Manager as quickly as possible to improve system admin user 
experience.

But if that node is bad, then registering to the RM is just adding unnecessary 
work. The NM health check script can check for many things that are known 
without a container being run. For example, docker could not be installed, or 
nscd not running (causing a user lookup for every new container). These could 
be reasons for the node to declare itself as unhealthy depending on the 
specific health check script. If we register with the RM and then declare the 
node unhealthy afterwards then we have to kill every container that was 
scheduled in the period between registration and first heartbeat.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2019-09-03 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921761#comment-16921761
 ] 

Eric Yang commented on YARN-9809:
-

[~ebadger] LocalDirsHandlerService checkDir is a timer task.  It is low 
probability for the schedule task to complete before registration call happens. 
 Although it is good to have a way to prevent scheduling containers to a node 
manager that is going through registration process to save network round trips 
and compute resources, the existing async design allows the node to show up in 
Resource Manager as quickly as possible to improve system admin user 
experience.  I think there is merits in both approaches, but they seem mutually 
exclusive to each other.  Please shed lights on your plan to keep existing 
responsiveness of registration and prevent containers leaking to bad node.  
Thanks

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2019-09-03 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921754#comment-16921754
 ] 

Eric Badger commented on YARN-9809:
---

bq. It is unlikely to determine unhealthy status until at least one container 
tried to run on the given node manager.
This scenario can happen when none of the local dirs are available due to bad 
disks or for any other arbitrary reason in the health check script. For 
example, we have an optional offline file that can be set on the node to mark 
it as unhealthy. 

bq. How does health status field in registration heartbeat help?
If the node can register as unhealthy then it won't ever have containers 
assigned to it. There is currently a period of time between registration and 
the first node heartbeat where the node appears to be healthy.

bq. If containers are getting killed, they are supposed to schedule else where. 
Do you observe any problem in rescheduling containers?
Yes, the containers will get rescheduled, but it is still wasteful to schedule 
containers to a node if we are just going to kill them shortly after. If this 
happens over many nodes at once then there are a lot of unnecessary container 
kills happening which we can avoid by sending the health status of the node 
with the initial RM registration.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2019-09-03 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921751#comment-16921751
 ] 

Eric Yang commented on YARN-9809:
-

[~ebadger] It is unlikely to determine unhealthy status until at least one 
container tried to run on the given node manager.  How does health status field 
in registration heartbeat help?  If containers are getting killed, they are 
supposed to schedule else where.  Do you observe any problem in rescheduling 
containers?

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2019-09-03 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921741#comment-16921741
 ] 

Eric Badger commented on YARN-9809:
---

I propose adding a health status field to the NM-RM registration request. That 
way the NM can still register to the RM, but without getting a bunch of 
containers that will immediately get killed after the first heartbeat.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org