[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203473#comment-17203473 ] Eric Payne commented on YARN-9809: -- I have committed this to branch-3.3 and branch-3.2. It looks like there is some additional work necessary if we want this to be backported to 3.1. I, for one, don't think that is necessary, but please comment if you disagree. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, > YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, > YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, > YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, > YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203346#comment-17203346 ] Eric Payne commented on YARN-9809: -- The latest branch-3.2 precommit build looks fine. The unit test failures are the same ones that are failing on branch-3.2 without the patch _except_ {{TestRaceWhenRelogin}}, which is not failing for me in my local build with or without the patch. +1. I will commit this today. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, > YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, > YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, > YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, > YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202448#comment-17202448 ] Hadoop QA commented on YARN-9809: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 21m 33s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 1s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:blue}0{color} | {color:blue} buf {color} | {color:blue} 0m 0s{color} | {color:blue}{color} | {color:blue} buf was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 18 new or modified test files. {color} | || || || || {color:brown} branch-3.2 Compile Tests {color} || || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 3m 35s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 33m 37s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 23m 16s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 4m 38s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 7m 29s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 29m 38s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 41s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 2m 38s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 14m 31s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 32s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 7s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 22m 4s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 22m 4s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 22m 4s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 4m 16s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/195/artifact/out/diff-checkstyle-root.txt{color} | {color:orange} root: The patch generated 3 new + 1258 unchanged - 1 fixed = 1261 total (was 1259) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 7m 5s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green}{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 24s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 59s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 11m 42s{color} | {color:green}{color} |
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202390#comment-17202390 ] Eric Payne commented on YARN-9809: -- Version 009 LGTM. +1 > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, > YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, > YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, > YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, > YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202388#comment-17202388 ] Eric Payne commented on YARN-9809: -- Thanks a lot, [~ebadger] for the backport, and thank you [~Jim_Brennan] for the great reviews. I have verified that the following unit tests are also failing in branch-3.2: {noformat} TestYarnConfigurationFields TestZKConfigurationStore TestSystemMetricsPublisherForV2 TestFSSchedulerConfigurationStore TestCombinedSystemMetricsPublisher {noformat} > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, > YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, > YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, > YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, > YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202352#comment-17202352 ] Jim Brennan commented on YARN-9809: --- Thanks for fixing the test [~ebadger]! +1 for YARN-9809-branch-3.2.009.patch > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, > YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, > YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, > YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, > YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202314#comment-17202314 ] Eric Badger commented on YARN-9809: --- So close. Those pesky unit tests. Patch 009 fixes the unit test failure. Thanks for the review, [~Jim_Brennan]! > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, > YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, > YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, > YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, > YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202289#comment-17202289 ] Jim Brennan commented on YARN-9809: --- Thanks [~ebadger] for the updated patch! I am +1 on YARN-9809-branch-3.2.008.patch except for this one failing unit test: TestYarnConfigurationFields {noformat} 2020-09-25 11:46:46,358 ERROR conf.TestConfigurationFieldsBase (TestConfigurationFieldsBase.java:testCompareConfigurationClassAgainstXml(485)) - class org.apache.hadoop.yarn.conf.YarnConfiguration has 1 variables missing in yarn-default.xml 2020-09-25 11:46:46,359 INFO conf.TestConfigurationFieldsBase (TestConfigurationFieldsBase.java:lambda$appendMissingEntries$1(507)) - yarn.nodemanager.health-checker.run-before-startup {noformat} > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, > YARN-9809-branch-3.2.008.patch, YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201900#comment-17201900 ] Hadoop QA commented on YARN-9809: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 12m 34s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 1s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:blue}0{color} | {color:blue} buf {color} | {color:blue} 0m 0s{color} | {color:blue}{color} | {color:blue} buf was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 18 new or modified test files. {color} | || || || || {color:brown} branch-3.2 Compile Tests {color} || || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 3m 26s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 29m 27s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 20m 26s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 3m 54s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 5m 15s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 27m 6s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 13s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 2m 16s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 10m 21s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 28s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 30s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 20m 9s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 20m 9s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 20m 9s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 3m 50s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/194/artifact/out/diff-checkstyle-root.txt{color} | {color:orange} root: The patch generated 3 new + 1258 unchanged - 1 fixed = 1261 total (was 1259) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 5m 22s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 48s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 57s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 11m 20s{color} | {color:green}{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 11m 2s{color}
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201814#comment-17201814 ] Eric Badger commented on YARN-9809: --- I've attached branch-3.2 patch 008 to address your comments, [~Jim_Brennan]. I think I got all of the unit tests to pass. But TestCombinedSystemMetricsPublisher, TestSystemMetricsPublisherForV2, TestFSSchedulerConfigurationStore, and TestZKConfigurationStore failed for me locally on straight up branch-3.2 > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, > YARN-9809-branch-3.2.008.patch, YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201782#comment-17201782 ] Eric Badger commented on YARN-9809: --- {noformat} RMNodeImpl#AddNodeTransition#transition RMNodeStatusEvent rmNodeStatusEvent = new RMNodeStatusEvent(nodeId, nodeStatus); NodeHealthStatus nodeHealthStatus = updateRMNodeFromStatusEvents(rmNode, rmNodeStatusEvent); if (nodeHealthStatus.getIsNodeHealthy()) { {noformat} bq. Do we run the risk of nodeHealthStatus being null? [~epayne], nope we should be fine here. {{nodeHealthStatus}} comes from the return value of {{updateRMNodeFromStatusEvents}}. The return value of that method comes from {{statusEvent.getNodeHealthStatus()}}. But {{statusEvent}} is passed into this method via an argument. On the caller side that argument is named {{rmNodeStatusEvent}} and it is craeted a few lines up via the RMNodeStatusEvent constructor. The {{nodeStatus}} is set there via the constructor and we know it won't be null because we are in the "else" of the "if" statement that checked for {{nodeStatus}} being null. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, > YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, > YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201136#comment-17201136 ] Jim Brennan commented on YARN-9809: --- I finished a first pass. Here are my comments: NodeHealthScriptRunner * Need to add code to Nodemanager to get the runBeforeStartup conf and pass it to constructor. * Need to make startup run optional based on runBeforeStartup RegisterNodeManagerRequest * See the trunk version of the patch. You should only have to add the new parameter to the last newInstance() interface, and have the second to last pass null. * This might reduce the number of tests you need to modify. RMNodeImpl * addNodeTransition - I think this line should this line be removed? {noformat} // Increment activeNodes explicitly because this is a new node. ClusterMetrics.getMetrics().incrNumActiveNodes(); {noformat} * updateMetricsForRejoinedNode - think we need to remove metrics.incrNumActiveNodes(); TestRMNodeTransitions * new testAddUnhealthyNode() test is not here These should not be needed if you fix constructors for RegisterNodeManagerRequest * TestProtocolRecords * TestRegisterNodeManagerRequest * TestResourceTrackerOnHA * TestYarnServerApiClasses > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, > YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, > YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201116#comment-17201116 ] Eric Badger commented on YARN-9809: --- Thanks for the initial reviews, [~epayne] and [~Jim_Brennan]! I will put up an updated patch soon with changes related to your comments. I also noticed some other issues that are manifesting as the unit test failures. So I will fix those as well. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, > YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, > YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201098#comment-17201098 ] Jim Brennan commented on YARN-9809: --- Thanks [~ebadger] for putting up a branch-3.2 patch! I am still reading, but wanted to make this initial comment: This patch does not include the config parameter {{NM_HEALTH_CHECK_RUN_BEFORE_STARTUP}} to make this an opt-in feature. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, > YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, > YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201061#comment-17201061 ] Eric Payne commented on YARN-9809: -- Thanks a lot [~ebadger] for putting upt the 3.2 backport patch. I'm still going through it, but I had one question after my first pass: {code:java|title=RMNodeImpl#AddNodeTransition#transition} RMNodeStatusEvent rmNodeStatusEvent = new RMNodeStatusEvent(nodeId, nodeStatus); NodeHealthStatus nodeHealthStatus = updateRMNodeFromStatusEvents(rmNode, rmNodeStatusEvent); if (nodeHealthStatus.getIsNodeHealthy()) { {code} Do we run the risk of {{nodeHealthStatus}} being null? > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, > YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, > YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199792#comment-17199792 ] Hadoop QA commented on YARN-9809: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 22m 27s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 1s{color} | {color:green} No case conflicting files found. {color} | | {color:blue}0{color} | {color:blue} buf {color} | {color:blue} 0m 0s{color} | {color:blue} buf was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 21 new or modified test files. {color} | || || || || {color:brown} branch-3.2 Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 20s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 59s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 19m 36s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 3m 14s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 11s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 21m 46s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 11s{color} | {color:green} branch-3.2 passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 0m 53s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 55s{color} | {color:green} branch-3.2 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 19s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 14m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 14m 24s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 2m 53s{color} | {color:orange} root: The patch generated 4 new + 1035 unchanged - 1 fixed = 1039 total (was 1036) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 25s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 7m 37s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 14s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 37s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m 13s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 78m 18s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 25m 8s{color} | {color:green} hadoop-yarn-client in the patch passed. {color}
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199707#comment-17199707 ] Eric Badger commented on YARN-9809: --- [~epayne], [~Jim_Brennan], sorry for the delay. I have put up a patch for branch-3.2. However I think this needs another round of review because the diff was quite massive on the cherry-pick and I had to redo a lot of stuff by hand. So in a lot of ways, this is a completely new patch. I think I got all of the unit tests that would've failed, but we'll see what HadoopQA says. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, > YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, > YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190724#comment-17190724 ] Jim Brennan commented on YARN-9809: --- No objection to backporting. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190414#comment-17190414 ] Eric Payne commented on YARN-9809: -- [~ebadger], this doesn't backport cleanly to 3.2. Would you mind taking a look when you have a chance? > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190392#comment-17190392 ] Eric Payne commented on YARN-9809: -- Unless there are objections, I would like to backport this to 3.1. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148914#comment-17148914 ] Hudson commented on YARN-9809: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18394 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18394/]) YARN-9809. Added node manager health status to resource manager (eyang: rev e8dc862d3856e9eaea124c625dade36f1dd53fe2) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/NodeManager.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/health/TimedHealthReporterService.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/resourcetracker/TestNMReconnect.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestEventFlow.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/logaggregationstatus/TestRMAppLogAggregationStatus.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerHealth.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/health/NodeHealthScriptRunner.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStartedEvent.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesNodes.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/BaseContainerManagerTest.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/scheduler/TestContainerSchedulerQueuing.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManager.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java * (edit)
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148906#comment-17148906 ] Eric Badger commented on YARN-9809: --- Thanks, [~eyang] for the review and commit and [~Jim_Brennan] for the review! > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148829#comment-17148829 ] Eric Badger commented on YARN-9809: --- Thanks for the review, [~eyang]! Are you planning on committing this or would you like me to? > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148066#comment-17148066 ] Eric Yang commented on YARN-9809: - +1 for patch 007. Tested both healthy and unhealthy health check scripts in my limited 1 node environment. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146522#comment-17146522 ] Eric Badger commented on YARN-9809: --- Thanks, [~Jim_Brennan]! [~eyang], would you take another look? > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146402#comment-17146402 ] Jim Brennan commented on YARN-9809: --- Thanks for the update [~ebadger]! I am +1 (non-binding) on patch 007. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145981#comment-17145981 ] Hadoop QA commented on YARN-9809: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 21m 36s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:blue}0{color} | {color:blue} prototool {color} | {color:blue} 0m 0s{color} | {color:blue} prototool was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 20 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 3m 27s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 19s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 21m 3s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 53s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 54s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 8m 36s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 57s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 7m 30s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 7m 30s{color} | {color:red} hadoop-yarn-project_hadoop-yarn generated 1 new + 334 unchanged - 0 fixed = 335 total (was 334) {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 52s{color} | {color:green} hadoop-yarn-project/hadoop-yarn: The patch generated 0 new + 1231 unchanged - 3 fixed = 1231 total (was 1234) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 12s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 40s{color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-common generated 11 new + 89 unchanged - 11 fixed = 100 total (was 100) {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 8m 56s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 8s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 16s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 54s{color} | {color:green} hadoop-yarn-server-common in the patch
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145904#comment-17145904 ] Hadoop QA commented on YARN-9809: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 2m 19s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:blue}0{color} | {color:blue} prototool {color} | {color:blue} 0m 0s{color} | {color:blue} prototool was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 1s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 20 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 27s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 25m 3s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 43s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 2m 21s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 10m 10s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 26s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 10m 25s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 10m 25s{color} | {color:red} hadoop-yarn-project_hadoop-yarn generated 1 new + 334 unchanged - 0 fixed = 335 total (was 334) {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 56s{color} | {color:green} hadoop-yarn-project/hadoop-yarn: The patch generated 0 new + 1232 unchanged - 3 fixed = 1232 total (was 1235) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 18m 9s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 10m 33s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 0s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 18s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 54s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 22m 27s{color} |
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145893#comment-17145893 ] Eric Badger commented on YARN-9809: --- Good catch, [~Jim_Brennan]. {{updateMetricsForRejoinedNode()}} is only called in one other place and I don't want to add the node and then remove it again. So I removed the increment from {{updateMetricsForRejoinedNode()}} and explicitly added it to just before the other place where {{updateMetricsForRejoinedNode()}} is called. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145869#comment-17145869 ] Jim Brennan commented on YARN-9809: --- Thanks for the updates [~ebadger]! I have one comment on the new patch: RMNodeImpl * I think there's a bug from moving the call to \{{ClusterMetrics.getMetrics().incrNumActiveNodes()}}. If previousRMNode != null (in the first check), we call \{{rmNode.updateMetricsForRejoinedNode()}}, which decrements the counter for the previous state and increments num active nodes. With your change, we now increment active nodes again when we call reportNodeRunning. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145717#comment-17145717 ] Eric Badger commented on YARN-9809: --- The TestFairScheduler and TestFairSchedulerPreemption test failures are unrelated to this JIRA as they have also been reported in https://issues.apache.org/jira/browse/YARN-10329 > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145711#comment-17145711 ] Eric Badger commented on YARN-9809: --- Patch 006 moves {{ClusterMetrics.getMetrics().incrNumActiveNodes();}} into {{reportNodeRunning}} inside of the addNodeTransition. This fixes the failing unit test and prevents a scenario where we add an unhealthy node as RUNNING and then quickly switching it to UNHEALTHY. This way we go straight to UNHEALTHY. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17144632#comment-17144632 ] Hadoop QA commented on YARN-9809: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 16s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 1s{color} | {color:green} No case conflicting files found. {color} | | {color:blue}0{color} | {color:blue} prototool {color} | {color:blue} 0m 0s{color} | {color:blue} prototool was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 20 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 2s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 23s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 21m 46s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 25s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 49s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 8m 16s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 22s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 8m 3s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 8m 3s{color} | {color:red} hadoop-yarn-project_hadoop-yarn generated 1 new + 334 unchanged - 0 fixed = 335 total (was 334) {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 41s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 2 new + 1232 unchanged - 3 fixed = 1234 total (was 1235) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 43s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 8m 50s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 59s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 1s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 40s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 22m 35s{color} |
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17144544#comment-17144544 ] Eric Badger commented on YARN-9809: --- Thanks for the review, [~Jim_Brennan]! I've uploaded patch 005 to fix the things you mentioned in your comments > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140841#comment-17140841 ] Jim Brennan commented on YARN-9809: --- Thanks for the patch [~ebadger]! Overall I think the design and code look good. Here are some comments: MockNM - line 196 - Isn't this loop that removes completedContainers a no-op? {noformat} ArrayList completedContainers = new ArrayList(); status.setContainersStatuses( new ArrayList(containerStats.values())); for (ContainerId cid : completedContainers) { containerStats.remove(cid); } {noformat} MockRM - This code is repeated in a lot of tests. Maybe we could add a function somewhere that does this so we can just pass getMockNodeStatus() instead? TestAbstractYarnScheduler, TestCapacityScheduler, testFairScheduler, TestFifoScheduler, TestNMExpiry, TestNMReconnect, TestResourceManager, TestRMAppLogAggregationStatus, TestRMNodeTransitions, TestRMWebServicesNodes, TestSchedulerHealth, {noformat} NodeStatus mockNodeStatus = mock(NodeStatus.class); NodeHealthStatus mockNodeHealthStatus = mock(NodeHealthStatus.class); when(mockNodeStatus.getNodeHealthStatus()).thenReturn(mockNodeHealthStatus); when(mockNodeHealthStatus.getIsNodeHealthy()).thenReturn(true); {noformat} RMAppManager - This looks like an accidental edit: {noformat} // Escape YarnServerCommonServiceProtossequences {noformat} RMNodeImpl - line 894 Don't we have to deal with the possibility that nodeStatus is null here? Seems like that is a possibilty. I think null nodeStatus should be treated as healthy. The RegisterNodeManagerRequest constructors that pass null is what made me think this is necessary? {noformat} NodeStatus nodeStatus = startEvent.getNodeStatus(); RMNodeStatusEvent rmNodeStatusEvent = new RMNodeStatusEvent(nodeId, nodeStatus); NodeHealthStatus nodeHealthStatus = updateRMNodeFromStatusEvents(rmNode, rmNodeStatusEvent); NodeState nodeState = null; if (nodeHealthStatus.getIsNodeHealthy()) { {noformat} - In the case where the node is unhealthy, can we just call reportNodeUnusable() instead of {noformat} rmNode.context.getDispatcher().getEventHandler().handle( new NodesListManagerEvent( NodesListManagerEventType.NODE_UNUSABLE, rmNode)); //Update the metrics rmNode.updateMetricsForDeactivatedNode(NodeState.RUNNING, NodeState.UNHEALTHY); {noformat} TestRMNodeTransitions - Maybe add a testAddUnhealthy here? TimedHealthReporterService - Do we need to be concerned about someone who might have their own implementation of TimedHealthReporterService? Should we maintain a constructor that takes two args and passes null for runBeforeStartup? > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139516#comment-17139516 ] Jim Brennan commented on YARN-9809: --- I have started reviewing the patch, but I will need more time. I hope to finish today or tomorrow. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139511#comment-17139511 ] Eric Yang commented on YARN-9809: - [~ebadger] [~Jim_Brennan] I agree that health check script handling is separated from register health check status. +1 on patch 004. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139451#comment-17139451 ] Jim Brennan commented on YARN-9809: --- [~eyang], [~ebadger] changing the behavior of health-check scripts seems pretty dangerous. We looked into this issue a few years ago, because we had some cases where the health-check scripts were not installed properly, and some bad nodes were erroneously reporting healthy status. Rather than try to change the contract for how health-check scripts behave, which has been around for a very long time, we instead added a wrapper script that we ship with hadoop. The wrapper checks that the real health-check script exists and is executable, and if it's not, it prints an "ERROR" message so the NM will mark the node unhealthy. If the health-check script is good, we just exec it. I agree that changing the handling of health check script output/return value is beyond the scope of this Jira. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138899#comment-17138899 ] Eric Badger commented on YARN-9809: --- I can see pros and cons to both approaches. On the one hand, if the health check script fails to execute properly, that's not good and could imply something bad. But health check scripts are pretty dangerous since they can take out an entire cluster if they're written improperly. So if someone updates the script and all of a sudden the script errors out, the whole cluster is unhealthy. Or the health check script could rely on querying a service and that service times out. The node is healthy, but the health check script returned error. Unless you are parsing for specific error codes, you can no longer differentiate between the health check script failing internally and the health check script returning successfully that the node is unhealthy. Regardless of this discussion though, this is outside of the scope of this JIRA. That's an issue with how the health check script is handled while this JIRA is just about providing a health status at NM startup > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138894#comment-17138894 ] Eric Yang commented on YARN-9809: - [~ebadger] Sorry, my statement was not clear. If the script name is incorrect, resulting exit code is non-zero, or the execution exit code is non-zero. In those cases, health check will report as healthy. I think those conditions must be considered as unhealthy, in the event that check script does not have proper prerequisites. The errors can be caught. Is this something that we can fix to make this more user friendly? > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138892#comment-17138892 ] Eric Badger commented on YARN-9809: --- {noformat:title=NodeHealthScriptRunner.newInstance()} if (!shouldRun(scriptName, nodeHealthScript)) { return null; } {noformat} {noformat:title=NodeHealthScriptRunner.shouldRun()} static boolean shouldRun(String script, String healthScript) { if (healthScript == null || healthScript.trim().isEmpty()) { LOG.info("Missing location for the node health check script \"{}\".", script); return false; } {noformat} If the health check script doesn't exist, then the health {{shouldRun}} will return false and the {{newInstance}} will return null. This will cause the health reporter to not be added as a service. So at the end of the day, your statement is correct. If the health check script doesn't exist, the node will report as healthy. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138853#comment-17138853 ] Eric Yang commented on YARN-9809: - [~Jim_Brennan] Thank you for the instruction. I updated my check script accordingly to: {code} #!/bin/bash echo "ERROR test" {code} This works. The script must return 0 exit code to work as well, otherwise, it will report as healthy. This implies, if the health check script doesn't exist, it reports as healthy. Is this right? > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138833#comment-17138833 ] Jim Brennan commented on YARN-9809: --- [~eyang] I believe the health check script output must contain a line that begins with the string "ERROR" for the node to be marked as unhealthy. The exit code does not have any effect. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138685#comment-17138685 ] Eric Yang commented on YARN-9809: - [~ebadger] Thank you for the patch. The patch looks very close to final product. I have confirmed the test case failure doesn't happen, if there are sufficient amount of RAM on the testing node. I also validated that new node manager can work with unpatched resource manager. However, I could not get health check script to fail to cause node registered as unhealthy. Here is my check script: {code} #!/bin/bash echo "i am here" > /tmp/hello exit 1 {code} It would be nice to have verbose message to show the exit code of the health check script in node manager log file. The script is executed, but it shows healthy. What am I doing wrong? > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137934#comment-17137934 ] Eric Badger commented on YARN-9809: --- {noformat} hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesSchedulerActivities hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer {noformat} Neither of these tests fail for me locally and are unrelated to the changes made in patch 004. Both the javac and the javadoc errors are coming from generated protobuf java files. I don't know how to get rid of these errors, but they aren't introducing any warnings that don't already exist. I think they're fine. The generation of the java files is the issue here. [~Jim_Brennan], [~ccondit], [~eyang], could you guys review patch 004? > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136194#comment-17136194 ] Hadoop QA commented on YARN-9809: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 5s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:blue}0{color} | {color:blue} prototool {color} | {color:blue} 0m 0s{color} | {color:blue} prototool was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 20 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 16s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 52s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 20m 43s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 45s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 2m 10s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 9m 8s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 28s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 9m 47s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 9m 47s{color} | {color:red} hadoop-yarn-project_hadoop-yarn generated 1 new + 334 unchanged - 0 fixed = 335 total (was 334) {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 58s{color} | {color:green} hadoop-yarn-project/hadoop-yarn: The patch generated 0 new + 1255 unchanged - 3 fixed = 1255 total (was 1258) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 18s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 39s{color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-common generated 1 new + 99 unchanged - 1 fixed = 100 total (was 100) {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 10m 33s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 13s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 34s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 8s{color} | {color:green} hadoop-yarn-server-common in the patch
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136095#comment-17136095 ] Eric Badger commented on YARN-9809: --- Patch 004 fixes checkstyle. There is still the javac error with PARSER being deprecated, but I don't know how to get rid of that. It is coming from a generated proto file. So I'm not quite sure what to do about that. The PARSER is used in many other places within the same generated file > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134646#comment-17134646 ] Hadoop QA commented on YARN-9809: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 24m 34s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:blue}0{color} | {color:blue} prototool {color} | {color:blue} 0m 0s{color} | {color:blue} prototool was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 17 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 21s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 52s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 7s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 48s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 21m 52s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 31s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 53s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 9m 10s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 8m 23s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 8m 23s{color} | {color:red} hadoop-yarn-project_hadoop-yarn generated 1 new + 334 unchanged - 0 fixed = 335 total (was 334) {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 39s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 4 new + 1153 unchanged - 2 fixed = 1157 total (was 1155) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 4s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 8m 46s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 58s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 3s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 40s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 22m 4s{color} | {color:red}
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128817#comment-17128817 ] Hadoop QA commented on YARN-9809: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 31m 30s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:blue}0{color} | {color:blue} prototool {color} | {color:blue} 0m 0s{color} | {color:blue} prototool was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 6 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 16s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 34s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 11m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 5m 30s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 24m 29s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 51s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 2m 18s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 10m 5s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 42s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 10m 53s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 10m 53s{color} | {color:red} hadoop-yarn-project_hadoop-yarn generated 1 new + 334 unchanged - 0 fixed = 335 total (was 334) {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 52s{color} | {color:green} hadoop-yarn-project/hadoop-yarn: The patch generated 0 new + 484 unchanged - 1 fixed = 484 total (was 485) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 44s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 11m 22s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 11s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 25s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 59s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 14m 34s{color} | {color:red}
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128686#comment-17128686 ] Eric Badger commented on YARN-9809: --- Attaching patch 002 to address unit test failures > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127215#comment-17127215 ] Hadoop QA commented on YARN-9809: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 29s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:blue}0{color} | {color:blue} prototool {color} | {color:blue} 0m 0s{color} | {color:blue} prototool was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 5 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 18s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 19s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 20m 37s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 21s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 49s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 20s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 23s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 8m 4s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 8m 4s{color} | {color:red} hadoop-yarn-project_hadoop-yarn generated 1 new + 334 unchanged - 0 fixed = 335 total (was 334) {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 34s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 5 new + 474 unchanged - 1 fixed = 479 total (was 475) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 40s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 46s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 59s{color} | {color:red} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 39s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 13m 11s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}345m 10s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 43s{color}
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127025#comment-17127025 ] Eric Badger commented on YARN-9809: --- Patch 001 adds the feature but makes it opt-in via the config {{yarn.nodemanager.health-checker.run-before-startup}}. I didn't put in the retries flag for shutting down the NM if there are a certain number of failures. I can do that in a subsequent patch if you'd like. But I tested this patch out and it seems to work. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110361#comment-17110361 ] Jim Brennan commented on YARN-9809: --- [~ccondit] I agree that a config to allow one to opt-in to this feature is a good idea. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110356#comment-17110356 ] Craig Condit commented on YARN-9809: Since health check scripts are by nature different for every deployment, it seems that neither the current behavior nor what is proposed makes sense in all cases. However, I believe there can be some middle ground. I propose we keep the existing behavior as default to avoid causing pain for existing users, but allow a configuration to opt-in to a single synchronous execution of a health check on startup before node check-in (controlled via a new *{{yarn.nodemanager.health-checker.preflight.enabled}}* boolean configuration). It may also be desirable to kill the NM upon repeated failures of this preflight check. We could add a new config *{{yarn.nodemanager.health-checker.preflight.retries}}* to control the number of retries before aborting the NM (or -1 for infinite). > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108617#comment-17108617 ] Eric Yang commented on YARN-9809: - [~Jim_Brennan] This feature is a great addition to make admin task easier for large scale cluster. What is the latency that we are talking about in health-check script? If it is a few seconds and less, I agree that there is marginal difference in startup time, and potential benefit is great. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108576#comment-17108576 ] Jim Brennan commented on YARN-9809: --- I would like to revive this discussion. We have this implemented internally. Our health-check script runs very quickly, so the impact on the time it takes to register with the RM is minimal (not really noticeable in our case). Our health-check script does a number of checks to validate the health of the host the NM is running on. We don't have any checks directly related to success/failure of containers launching since the last check, but even if we did, that particular check just wouldn't find anything if no containers have been launched yet. The cases that we were trying to address with this change involve hardware or os issues with the node that may not prevent a container from launching, but are serious enough to mark the node as unhealthy (memory/disk errors, etc...). We have seen this during a rolling upgrade. Nodes that had been previously marked as unhealthy would be brought up as part of the RU, and those nodes would start running containers only to be marked unhealthy 10 minutes later when the health-check script ran. This caused a lot of killed task attempts. With large clusters there can be hundreds of nodes that are unhealthy, so there can be a lot of failed task attempts. It seems the main question that [~eyang] is raising is whether we should allow a synchronous call to run the health-check script during nodemanager startup/registration. I agree that this can introduce a potential slowdown if the health-check-script is slow. In our case, the delay is not noticeable, and we think it is worth it to prevent the false start. What do others think? cc: [~ebadger], [~eyang], [~ccondit-target], [~shaneku...@gmail.com] > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922542#comment-16922542 ] Eric Badger commented on YARN-9809: --- bq. Although it is good to have a way to prevent scheduling containers to a node manager that is going through registration process to save network round trips and compute resources, the existing async design allows the node to show up in Resource Manager as quickly as possible to improve system admin user experience. But if that node is bad, then registering to the RM is just adding unnecessary work. The NM health check script can check for many things that are known without a container being run. For example, docker could not be installed, or nscd not running (causing a user lookup for every new container). These could be reasons for the node to declare itself as unhealthy depending on the specific health check script. If we register with the RM and then declare the node unhealthy afterwards then we have to kill every container that was scheduled in the period between registration and first heartbeat. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921761#comment-16921761 ] Eric Yang commented on YARN-9809: - [~ebadger] LocalDirsHandlerService checkDir is a timer task. It is low probability for the schedule task to complete before registration call happens. Although it is good to have a way to prevent scheduling containers to a node manager that is going through registration process to save network round trips and compute resources, the existing async design allows the node to show up in Resource Manager as quickly as possible to improve system admin user experience. I think there is merits in both approaches, but they seem mutually exclusive to each other. Please shed lights on your plan to keep existing responsiveness of registration and prevent containers leaking to bad node. Thanks > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921754#comment-16921754 ] Eric Badger commented on YARN-9809: --- bq. It is unlikely to determine unhealthy status until at least one container tried to run on the given node manager. This scenario can happen when none of the local dirs are available due to bad disks or for any other arbitrary reason in the health check script. For example, we have an optional offline file that can be set on the node to mark it as unhealthy. bq. How does health status field in registration heartbeat help? If the node can register as unhealthy then it won't ever have containers assigned to it. There is currently a period of time between registration and the first node heartbeat where the node appears to be healthy. bq. If containers are getting killed, they are supposed to schedule else where. Do you observe any problem in rescheduling containers? Yes, the containers will get rescheduled, but it is still wasteful to schedule containers to a node if we are just going to kill them shortly after. If this happens over many nodes at once then there are a lot of unnecessary container kills happening which we can avoid by sending the health status of the node with the initial RM registration. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921751#comment-16921751 ] Eric Yang commented on YARN-9809: - [~ebadger] It is unlikely to determine unhealthy status until at least one container tried to run on the given node manager. How does health status field in registration heartbeat help? If containers are getting killed, they are supposed to schedule else where. Do you observe any problem in rescheduling containers? > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921741#comment-16921741 ] Eric Badger commented on YARN-9809: --- I propose adding a health status field to the NM-RM registration request. That way the NM can still register to the RM, but without getting a bunch of containers that will immediately get killed after the first heartbeat. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org