[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030163#comment-17030163 ] Peter Bacsko commented on YARN-9011: Thanks [~mkonst] for sharing this. Yes, I was thinking about what happens if refresh is called quickly multiple times. The reason I ignored it is that this is initiated from the command line, and by the time you do it again, chances are that the node state has already been in DECOMMISSIONING for a (relatively) long time. So it's not a realistic use case, but more like a "sabotage". Correct me if this reasoning is wrong. Having said that, if you think it's something worth improving, I suggest that you create a follow-up JIRA and link this one. If you already have a patch or some code that you can share, I'm happy to review it. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.4 > > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch, YARN-9011-008.patch, > YARN-9011-009.patch, YARN-9011-branch-3.1.001.patch, > YARN-9011-branch-3.2.001.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029396#comment-17029396 ] Mikayla Konst commented on YARN-9011: - We experienced this exact same race condition recently (resource manager sending SHUTDOWN signal to node manager because it received a heartbeat from the node manager *after* the HostDetails reference was updated, but *before* the node was transitioned to state DECOMMISSIONING). I think this patch is a huge improvement over the previous behavior, but I think there is still a narrow race that can happen when refresh nodes is called multiple times in a row in quick succession with the same set of nodes in the exclude file: # lazy-loaded HostDetails reference is updated # nodes are added to gracefullyDecommissionableNodes set # current HostDetails reference is updated # event to update node status to DECOMMISSIONING is added to asynchronous event handler's event queue, but hasn't been processed yet # refresh nodes is called a second time # lazy-loaded HostDetails reference is updated # gracefullyDecommissionableNodes set is cleared # node manager heartbeats to resource manager. It is not in state DECOMMISSIONING and not in the gracefullyDecommissionableNodes set, but is an excluded node in the HostDetails, so it is sent a SHUTDOWN signal # node is added to gracefullyDecommissionableNodes set # event handler transitions node to state DECOMMISSIONING at some point This would be fixed if you used an AtomicReference for your set of "gracefullyDecommissionableNodes" and swapped out the reference, similar to how you handled the HostDetails. Alternatively, instead of using an asynchronous event handler to update the state of the nodes to DECOMMISSIONING, you could update the state synchronously. You could grab a lock, then update HostDetails and synchronously update the states of the nodes being gracefully decommissioned, then release the lock. When the resource tracker service receives a heartbeat and needs to check if a node should be shutdown (if it is excluded and in state decommissioning), it would grab the lock right before doing the check. Having the resource tracker service wait on a lock doesn't sound great, but it would likely be on the order of milliseconds, and only when refresh nodes is called. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.4 > > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch, YARN-9011-008.patch, > YARN-9011-009.patch, YARN-9011-branch-3.1.001.patch, > YARN-9011-branch-3.2.001.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16982470#comment-16982470 ] Szilard Nemeth commented on YARN-9011: -- Hi [~pbacsko]! Patches look good, committed them to their respective branches. Thanks for your contribution! > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch, YARN-9011-008.patch, > YARN-9011-009.patch, YARN-9011-branch-3.1.001.patch, > YARN-9011-branch-3.2.001.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977527#comment-16977527 ] Peter Bacsko commented on YARN-9011: [~snemeth] I think we're good to go with branch-3.2 and branch-3.1. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch, YARN-9011-008.patch, > YARN-9011-009.patch, YARN-9011-branch-3.1.001.patch, > YARN-9011-branch-3.2.001.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977524#comment-16977524 ] Hadoop QA commented on YARN-9011: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 22m 59s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-3.1 Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 2m 19s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 53s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 13m 13s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 20s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 5s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 44s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 37s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 53s{color} | {color:green} branch-3.1 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 16m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 16m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 11s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 43s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 71m 7s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 37s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}207m 30s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:080e9d0f9b3 | | JIRA Issue | YARN-9011 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12986224/YARN-9011-branch-3.1.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 1950314b69fc 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | branch-3.1 / 96e07b7 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/25197/testReport/ | | Max.
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976799#comment-16976799 ] Hadoop QA commented on YARN-9011: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 8m 22s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-3.2 Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 2m 16s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 26s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 35s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 15s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 9s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 1s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 0s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 49s{color} | {color:green} branch-3.2 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 22s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 13m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 13m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 40s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 46s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 41s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 20s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 47s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}187m 0s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisherForV2 | | | hadoop.yarn.server.resourcemanager.metrics.TestCombinedSystemMetricsPublisher | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:63396beab41 | | JIRA Issue | YARN-9011 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12986126/YARN-9011-branch-3.2.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux c632febb0e47 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | branch-3.2 /
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976782#comment-16976782 ] Hadoop QA commented on YARN-9011: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 39s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 13s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 20s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 16m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 12s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 18m 25s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 18s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 57s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 17m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 1s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 6s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 2s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 12s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 84m 57s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 43s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}205m 44s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | YARN-9011 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12985199/YARN-9011-009.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux bebc4d6a66bf 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 7f81172 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/25189/testReport/ | | Max. process+thread count | 1348 (vs. ulimit of 5500) | | modules
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976645#comment-16976645 ] Hudson commented on YARN-9011: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17657 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17657/]) YARN-9011. Race condition during decommissioning. Contributed by Peter (snemeth: rev 27642367ef3409a9ca93747c6c2cc279c087a4c0) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/HostsFileReader.java * (edit) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestHostsFileReader.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch, YARN-9011-008.patch, > YARN-9011-009.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976635#comment-16976635 ] Szilard Nemeth commented on YARN-9011: -- Thanks again [~pbacsko]! Committed to trunk! Thanks [~adam.antal] for the review! Could you please take care of the backport patch to at least branch-3.2? Thanks! > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch, YARN-9011-008.patch, > YARN-9011-009.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971492#comment-16971492 ] Szilard Nemeth commented on YARN-9011: -- Hi [~pbacsko]! Thanks for this patch, +1. Would commit this soon if no objections. [~bibinchundatt], [~tangzhankun]: Any more comments? Thanks! > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch, YARN-9011-008.patch, > YARN-9011-009.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969481#comment-16969481 ] Adam Antal commented on YARN-9011: -- Great! +1 (non-binding), hope to see this soon committed. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch, YARN-9011-008.patch, > YARN-9011-009.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969332#comment-16969332 ] Hadoop QA commented on YARN-9011: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 51s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 14s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 55s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 34s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 18m 21s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 1s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 58s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 21s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 15m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 54s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 2s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 5s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 85m 4s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 42s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}201m 52s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | YARN-9011 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12985199/YARN-9011-009.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux fc77a6870914 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / dd90025 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/25113/testReport/ | | Max. process+thread count | 1348 (vs. ulimit of 5500) | | modules
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969193#comment-16969193 ] Peter Bacsko commented on YARN-9011: [~adam.antal] thanks for the comments. _"[...] renaming HostsFileReader$refresh(String,String,boolean) to refreshInternal. Could you please do that to make that class more clear?"_ Done _"but I am assured that the internal structure will not get damaged by this."_ There's no danger there. Doing multiple lazy refresh isn't an issue. _"why did you move the following line inside ResourceTrackerService$nodeHeartbeat."_ Ah, that's a remnant from previous patches. Moved back. Also did the null-check improvement. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch, YARN-9011-008.patch, > YARN-9011-009.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968409#comment-16968409 ] Adam Antal commented on YARN-9011: -- Thanks for the patch, [~pbacsko]. I am not entirely sure that I perfectly understood this patch, but the above discussion made me confidence about this is the right approach. I think it would be a good idea to stick to the generic naming convention by renaming {{HostsFileReader$refresh(String,String,boolean)}} to {{refreshInternal}}. Could you please do that to make that class more clear? I was a bit concerned what happens if an unchecked exception occurs inside {{NodesListManager$handleExcludeNodeList}} (between calling {{HostsFileReader$lazyRefresh}} and {{HostsFileReader$finishRefresh}}), but I am assured that the internal structure will not get damaged by this. Also a side-question: why did you move the following line inside {{ResourceTrackerService$nodeHeartbeat}}. Since the same object is received and it is not possible that this object is added/removed during the two calls I would not touch it. {code:java} RMNode rmNode = this.rmContext.getRMNodes().get(nodeId); {code} I liked the extra null check you perform on {{rmNode.getState()}} in {{ResourceTrackerService $updateAppCollectorsMap}}. I'd take one step further - could you start that function with an rmNode == null condition (and returning false in that case), we could save the extra null checks in that function and it would also make the null check in {{NodesListManager $isGracefullyDecommissionableNode}} inessential. I'd give a +1 (non-binding) pending on these changes. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch, YARN-9011-008.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: >
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1696#comment-1696 ] Hadoop QA commented on YARN-9011: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 29m 1s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 7s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 19s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 46s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 12s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 27s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 14m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 24s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 21s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 11s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 81m 37s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 51s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}220m 19s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | YARN-9011 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12984267/YARN-9011-008.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 15df0be53ab2 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / ed45c13 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/25060/testReport/ | | Max. process+thread count | 1480 (vs. ulimit of 5500) | |
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961988#comment-16961988 ] Peter Bacsko commented on YARN-9011: Thanks [~tangzhankun] for the comments - glad we're on the same page. I just uploaded patch v8 with the suggested changes. Note that there's one more approach: when we perform {{refresh()}}, we pass whether we want a graceful exclusion or not. So we store extra information in {{HostDetails}} in a boolean field (eg. {{gracefulExclusion = true}}). This still requires a check inside {{isNodeInDecommissioning()}} but this results in a smaller change and we don't need to store {{RMNode}} instances in a set. I haven't deeply thought this over though, but it's something to consider. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch, YARN-9011-008.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961935#comment-16961935 ] Zhankun Tang commented on YARN-9011: [~pbacsko], Thanks for the explanation. After the offline sync up, this seems the good lock-free way to go. + 1 from me. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961897#comment-16961897 ] Peter Bacsko commented on YARN-9011: Ok, actually we always call the {{isGracefullyDecommissionableNode()}} method inside {{isNodeInDecommissioning()}}. We just have to slightly re-arrange the order of calls like: {noformat} private boolean isNodeInDecommissioning(NodeId nodeId) { RMNode rmNode = this.rmContext.getRMNodes().get(nodeId); // state OK - early return if (rmNode != null && rmNode.getState() == NodeState.DECOMMISSIONING) { return true; } // Graceful decom: wait until node moves out of RUNNING state. if (rmNode != null && this.nodesListManager.isGracefullyDecommissionableNode(rmNode)) { NodeState currentState = rmNode.getState(); if (currentState == NodeState.RUNNING) { return true; } } return false; } {noformat} This avoid the unnecessary invocation of {{nodesListManager.isGracefullyDecommissionableNode()}}. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961879#comment-16961879 ] Peter Bacsko commented on YARN-9011: _"1. Why do we need a lazy update?"_ Please see details in my comment above that I posted on 25th Sep: https://issues.apache.org/jira/browse/YARN-9011?focusedCommentId=16937696=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16937696 It is important that when you do a "lazy" refresh, you should not make the new changes visible to {{ResourceTrackerService}}. The problematic part of the code is this: {noformat} // 1. Check if it's a valid (i.e. not excluded) node, if not, see if it is // in decommissioning. if (!this.nodesListManager.isValidNode(nodeId.getHost()) && !isNodeInDecommissioning(nodeId)) { ... {noformat} If you perform a graceful decom, it is important that {{isNodeInDecommissioning()}} return true. However, it takes time for {{RMAppImpl}} to go into {{DECOMMISSIONING}} state, that's why this code is not fully reliable. Therefore, {{isValidNode()}} should only return false when we already constructed a set of nodes that we want to decommission. _2. Could we check the "Decommissioning" status before "isGracefullyDecommissionableNode" in method "isNodeInDecommissioning"?_ No, we can't (well, we can, but it would be pointless). Decomissioning status only occurs when you refresh (reload) the exclusion/inclusion files. That is, we need to call {{NodesListManager.refreshNodes()}}. And that is the problem - during refresh, excludeable nodes become visible almost immediately, but not the fact that they're decomissionable. _3. So it will always be scanned when heartbeat which seems not necessary._ Scanning is necessary to avoid the race condition, but this isn't really a problem because of two things: 1. It happens only for those nodes which are excluded ({{isValid()}} is false) 2. We lookup inside a ConcurrentHashMap, which should be really fast I can imagine an enhancement here: once the node reached {{DECOMISSIONING}} state, we remove it from the set, making it smaller and smaller. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node >
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961621#comment-16961621 ] Zhankun Tang commented on YARN-9011: [~pbacsko], Thanks for the new patch. The idea looks good to me. Several comments: 1. Why do we need a lazy update? I don't see "hostDetails" differences between "getLazyLoadedHostDetails" and "getHostDetails". 2. Could we check the "Decommissioning" status before "isGracefullyDecommissionableNode" in method "isNodeInDecommissioning"? Because The "gracefulDecommissionableNodes" will only be cleared after the refresh operation. So it will always be executed which seems not necessary. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961354#comment-16961354 ] Hadoop QA commented on YARN-9011: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 22s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 9s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 16m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 23s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 24s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 5s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 27s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 15m 27s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 2m 34s{color} | {color:orange} root: The patch generated 1 new + 41 unchanged - 0 fixed = 42 total (was 41) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 3s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 10s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 58s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 81m 51s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 50s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}197m 3s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | YARN-9011 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12984177/YARN-9011-007.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 322f3f8df33c 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / d5e9971 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | checkstyle |
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961133#comment-16961133 ] Peter Bacsko commented on YARN-9011: Thanks [~snemeth] for the comments - uploaded patch v7 with corrections. I realized that there are still TODOs in the patch, will address those in the next version. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961051#comment-16961051 ] Szilard Nemeth commented on YARN-9011: -- Hi [~pbacsko]! The PoC patch looks good to me, as for the approach. Some comments: 1. In {{org.apache.hadoop.yarn.server.resourcemanager.NodesListManager#NodesListManager}}: You are using a {{ConcurrentHashSet}} imported from {{org.eclipse.jetty.util.ConcurrentHashSet}} in the constructor: This constructor is deprecated, as per its javadoc: {code:java} /** * @deprecated Use Java 8 method {@code ConcurrentHashMap.newKeySet()} instead. */ {code} 2. The code you modified in {{NodesListManager}}: {code:java} public boolean isGracefullyDecommissionableNode(RMNode node) { return node == null ? false : gracefulDecommissionableNodes.contains(node); } {code} This condition could be a bit simpler: {code:java} node != null && gracefulDecommissionableNodes.contains(node); {code} Otherwise, I think you can go ahead and remove the TODOs and add some tests, as the PoC is good! Thanks! > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937879#comment-16937879 ] Hadoop QA commented on YARN-9011: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 23s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 8s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 16m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 48s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 59s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 21s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 15m 23s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 2m 31s{color} | {color:orange} root: The patch generated 3 new + 31 unchanged - 0 fixed = 34 total (was 31) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 29s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 7s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 9m 48s{color} | {color:red} hadoop-common in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 90m 1s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 43s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}205m 54s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.ha.TestZKFailoverController | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.2 Server=19.03.2 Image:yetus/hadoop:efed4450bf1 | | JIRA Issue | YARN-9011 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12981320/YARN-9011-006.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux ab4189af6041 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision |
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937707#comment-16937707 ] Peter Bacsko commented on YARN-9011: ping [~bibinchundatt], [~adam.antal], [~tangzhankun] > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937696#comment-16937696 ] Peter Bacsko commented on YARN-9011: So I uploaded patch v6 which is contains a completely new approach (without new tests so far). Here is the idea: 1. If graceful decomissioning was initiated, {{HostsFileReader}} loads the new XMLs _lazily_, meaning that "current" is not changed 2. We process the included/excluded hosts. Excluded hosts are added to an internal set called {{gracefulDecommissionableNodes}} (which is cleared at each refresh) 3. Once it's done, we call {{HostsFileReader.finishRefresh()}}, publishing the new settings By doing so, decomissionable hosts can be retrieved from {{NodesListManager}} so there's no inconsistency. There's a trick though: we still have to wait until {{RMNode}} leaves either RUNNING or UNHEALTHY state (see the changes in {{isNodeInDecommissioning()}}). Once this happened, we no longer need the set of {{RMNode}} instances inside {{NodesListManager}} and can rely on {{RMNode.getState()}}. Please reason about this proposal, whether is acceptable or not. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936907#comment-16936907 ] Hadoop QA commented on YARN-9011: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 39s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 14 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 44s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 32s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 26s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 28s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 20s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 10m 13s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 37s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 1 new + 562 unchanged - 3 fixed = 563 total (was 565) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 45s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 5 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 45s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 14s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}101m 41s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 25m 45s{color} | {color:green} hadoop-yarn-client in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 39s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}217m 31s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerOvercommit | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.2 Server=19.03.2 Image:yetus/hadoop:efed4450bf1 | | JIRA Issue | YARN-9011 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12981167/YARN-9011-005.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 86c5bb1bd80f 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality |
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936867#comment-16936867 ] Peter Bacsko commented on YARN-9011: " In large clusters we cant expect that to be mills." - if that's the case, then I agree that waiting for such a transition can be costly and might cause a bottleneck. I'll think about a fully synchronous approach. As you mentioned above, this means that we have to maintain a list of decomissionable nodes. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936842#comment-16936842 ] Bibin A Chundatt commented on YARN-9011: {quote} But even if you have to wait, it's a very small tiny window which is probably just milliseconds {quote} Depends on the time taken to process events. In large clusters we cant expect that to be mills. *Alternate approach* NodelistManager is the source for *GRACEFUL_DECOMMISSION* event based on which state transistion of RMNodeImpl to DECOMMISSIONING happens.I think as per YARN-3212 the state avoids the containers getting killed during the period of DECOMMISIONING. * We could maintain in nodelistmanager the list of to be decommissioned list for which the *GRACEFUL_DECOMMISSION* was fired. * HostsFileReader set the refreshed *HostDetails* only after the event is fired. This was the HostsFileReader and nodeState could be sync. Thoughts?? > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936729#comment-16936729 ] Peter Bacsko commented on YARN-9011: [~bibinchundatt] thanks for the insights. There's no impact on the functionality, everything is supposed to work the same way as before. "Any wait on ResourceTrackerService is costly, since we have limited number of handlers for ResourceTrackerService (consider 10k+ nodes and 100/200 handlers each nm heartbeating at 1 sec interval)" That's good to know. But right here we have a special case. It only applies when someone modify the hosts file and selects nodes for graceful decommissioning. In this case, there's a race between ResourceTrackerService and the selected RMNodeImpl instances. But even if you have to wait, it's a very small tiny window which is probably just milliseconds. In every other case, we don't wait anything. The normal code path is not affected. The only possible caveat here is that now {{NodesListManager.isValidNode()}} is guarded by a lock. However, I'm no longer sure that it's really necessary, I have to think it over. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936710#comment-16936710 ] Bibin A Chundatt commented on YARN-9011: [~pbacsko] Just glanced through the discussion point.. Could you explain the race in detail. Major impact interms of functionality ?? {quote} ResourceTrackerService uses NodesListManager to determine what nodes are enabled. But sometimes it sees an inconsistent state: NodesListManager returns that a certain node is in the excluded list, but it's state is not DECOMMISSIONING. *So we have to wait for this state change.* {quote} *Points to consider* * Any wait on ResourceTrackerService is costly, since we have limited number of handlers for ResourceTrackerService (consider 10k+ nodes and 100/200 handlers each nm heartbeating at 1 sec interval) * As per the current implementation the resourcetrackerService handler will wait till the state is changed rt ? > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936651#comment-16936651 ] Peter Bacsko commented on YARN-9011: Thanks for the comments [~adam.antal] Regarding the property, I'm not sure that this is something that needs to be user-faced. It's a very internal thing, 10 seconds is already a really big time window. If it's not enough, it means that the machine which runs the Resource Manager is heavlily overloaded and probably unable to carry out its duties properly. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936635#comment-16936635 ] Zhankun Tang commented on YARN-9011: [~pbacsko], I see. I may be missing something important. What about adding more check condition? Would it be ugly? Just a simple pseudo-code: {code:java} if !isValidNode && (!isNodeDecommissioning && !isNodeRunning) {code} > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936633#comment-16936633 ] Adam Antal commented on YARN-9011: -- Thank you for your patch [~pbacsko]. Overall I agree with your resolution, so I'm fine the main idea behind behind this patch. Some thing to consider / improve: - Structurally I'd like propose the following modification in {{DecommissioningNodesSyncer}}. I think a new config e.g. "yarn.resourcemanager.decommission-syncer.wait-ms" would come handy for this class instead of specifying it for each call of {{DecommissioningNodesSyncer$awaitDecommissiningStatus}}. This config can be determined in a new {{init()}} function or in the constructor - the former is the standard in Hadoop, though there are examples of the latter. Also, the latter would cause less harm in the other pieces of code you touched, while an {{init()}} function would cause more, but its your choice. I think the default (1 ms) is sufficient, you can also add that to {{YarnConfiguration}}. - In {{DecommissioningNodesSyncer}} you use {{ConcurrentLinkedDeque}} as a data structure. As far as I can see the used functions are {{add}}, {{remove}} and {{contains}}. That would make me think a {{ConcurrentHashSet}} (of Guava probably) would be a more suitable choice. - In {{NodeListManager}} I prefer using lock instead of grabbing an object and use that for synchronisation - I would request another 2 more person to take a look at this, but I think the synchronisation is proper. Optionally, I'd suggest to add a reference to this new form of synchronisation to the javadoc of {{ResourceTrackerService}} class, but I realised it did not have javadoc at all. If you're diligent, you can write it, but I assume it's out of scope for this patch. Minor nits: - In the javadoc of {{DecommissioningNodesSyncer}} some @links would be prettier - Also add some javadocs to public functions to that class ({{awaitDecommissiningStatus}} and {{setDecomissioningStatus}} is enough) - Could you add meaningful message in this assert in {{TestResourceTrackerService$testWaitForDecommTransition}}? {code:java} assertEquals("NodeAction", NodeAction.SHUTDOWN, response.getNodeAction()); {code} Thanks! > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO >
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936576#comment-16936576 ] Peter Bacsko commented on YARN-9011: [~tangzhankun] yes, that's correct. The problem is that {{RMNodeImpl}} does not immediately go to {{DECOMMISSIONING}} state, you have to wait for the state machine to complete the transition on a different thread. I was thinking about alternative approaches, for examplee introducing a new field to {{RMNodeImpl}} that stores graceful decommissioning intent and it could be updated synchronously from {{NodesListManager}}, so it's enough to have the {{synchronized}} block. But then you have deal with this extra variable and update it if necessary, so I felt that right now this approach is safer. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936329#comment-16936329 ] Zhankun Tang commented on YARN-9011: [~pbacsko], Thanks for the elaboration. Not sure if I understand this clearly. Is the unexpected "Disallowed NodeManager nodeId .." caused by the *false* value returned by "isNodeInDecommissioning(nodeId)"? If this node is not in the "decommissioning" state, what's the state of it now? In the middle of the transition from RUNNING state to DECOMMISSIONING state? {code:java} if (!this.nodesListManager.isValidNode(nodeId.getHost()) && !isNodeInDecommissioning(nodeId)) { {code} > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935731#comment-16935731 ] Peter Bacsko commented on YARN-9011: [~adam.antal] so the problem is that {{ResourceTrackerService}} uses {{NodesListManager}} to determine what nodes are healthy or not. But sometimes it sees an inconsistent state: {{NodesListManager}} returns that a certain node is in the excluded list, but it's state is not {{DECOMMISSIONING}}. So we have to wait for this state change. First, adding {{synchronized}} blocks to {{NodesListManager}} is necessary. When you call {{isValidNode()}} you have to wait until the XML (which contains the list of to-be-decommissioned nodes) is completely processed. However, if {{isValid()}} returns false, you don't know if graceful decommissioning is going on. If it is, the state of {{RMNodeImpl}} is {{NodeState.DECOMMISSIONING}}. But the catch is that state transition happens on a separate dispatcher thread so you have to wait for it. Most of the time it's quick enough, but you can miss it. When that happens, RTS simply considers a node to be "disallowed" and orders a shutdown immediately. So what's why I introduced a new class called {{DecommissioningNodesSyncer}}. If a node is selected for graceful decommissioning, it is added to a deque. When {{isValid()}} returns, we check if the node is included in this deque. Then we wait for the state transition with a {{Condition}} object. Signaling comes from {{RMNode}} itself. The change is a bit bigger than it should be because I modified constructors, so to avoid compilation problems, tests also had to be modified. Alternative to this is using a singleton {{DecommissioningNodesSyncer}} but I just don't like it. I prefer dependency injection to singletons. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO >
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932170#comment-16932170 ] Peter Bacsko commented on YARN-9011: [~adam.antal] absolutely, this patch needs some explanation - it might be difficult to see why all this is necessary. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932169#comment-16932169 ] Adam Antal commented on YARN-9011: -- Thanks for the patch [~pbacsko]. Could you please argue on the solution - what is proposed solution, etc.? > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930774#comment-16930774 ] Hadoop QA commented on YARN-9011: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 36s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 14 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 51s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 52s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 19s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 42s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 7s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 14s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 1 new + 562 unchanged - 3 fixed = 563 total (was 565) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 19s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 85m 13s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 25m 38s{color} | {color:green} hadoop-yarn-client in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 38s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}185m 55s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.2 Server=19.03.2 Image:yetus/hadoop:39e82acc485 | | JIRA Issue | YARN-9011 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12980422/YARN-9011-004.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 43a4d061ddbd 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 56f042c | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_212 | | findbugs | v3.1.0-RC1 | | checkstyle |
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930656#comment-16930656 ] Hadoop QA commented on YARN-9011: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 56s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 14 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 18s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 11m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 47s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 29s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 18s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 9m 58s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 19s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 3 new + 562 unchanged - 3 fixed = 565 total (was 565) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 23s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 49s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 24s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 86m 37s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 25m 42s{color} | {color:green} hadoop-yarn-client in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 37s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}202m 53s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:39e82acc485 | | JIRA Issue | YARN-9011 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12980402/YARN-9011-003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 0e055a82ebb4 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 85b1c72 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | |
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929397#comment-16929397 ] Hadoop QA commented on YARN-9011: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 19m 9s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 9 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 45s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 47s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 6s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 6s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 12s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 10 new + 343 unchanged - 0 fixed = 353 total (was 343) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 20s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 29s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 86m 22s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 25m 42s{color} | {color:green} hadoop-yarn-client in the patch passed. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 40s{color} | {color:red} The patch generated 1 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}207m 5s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | | Exceptional return value of java.util.concurrent.locks.Condition.await(long, TimeUnit) ignored in org.apache.hadoop.yarn.server.resourcemanager.DecommissioningNodesSyncer.awaitDecommissiningStatus(RMNode, int) At DecommissioningNodesSyncer.java:TimeUnit) ignored in org.apache.hadoop.yarn.server.resourcemanager.DecommissioningNodesSyncer.awaitDecommissiningStatus(RMNode, int) At DecommissioningNodesSyncer.java:[line 35] | | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations | | |
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929234#comment-16929234 ] Peter Bacsko commented on YARN-9011: Uploaded v2, it's still considered to be a POC. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch, YARN-9011-002.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928826#comment-16928826 ] Peter Bacsko commented on YARN-9011: Ignore patch v1, it's not enough. The proper solution is more complicated. Will update it soon. > Race condition during decommissioning > - > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9011-001.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9011) Race condition during decommissioning
[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928662#comment-16928662 ] Hadoop QA commented on YARN-9011: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 42s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 45s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 8s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 84m 19s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 26s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}140m 5s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.2 Server=19.03.2 Image:yetus/hadoop:f4f9f0fe4f2 | | JIRA Issue | YARN-9011 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12980192/YARN-9011-001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 56dfb8150a86 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / f4f9f0f | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_212 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24791/testReport/ | | Max. process+thread count | 813 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/24791/console | | Powered by | Apache Yetus 0.8.0