[ https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029396#comment-17029396 ]
Mikayla Konst commented on YARN-9011: ------------------------------------- We experienced this exact same race condition recently (resource manager sending SHUTDOWN signal to node manager because it received a heartbeat from the node manager *after* the HostDetails reference was updated, but *before* the node was transitioned to state DECOMMISSIONING). I think this patch is a huge improvement over the previous behavior, but I think there is still a narrow race that can happen when refresh nodes is called multiple times in a row in quick succession with the same set of nodes in the exclude file: # lazy-loaded HostDetails reference is updated # nodes are added to gracefullyDecommissionableNodes set # current HostDetails reference is updated # event to update node status to DECOMMISSIONING is added to asynchronous event handler's event queue, but hasn't been processed yet # refresh nodes is called a second time # lazy-loaded HostDetails reference is updated # gracefullyDecommissionableNodes set is cleared # node manager heartbeats to resource manager. It is not in state DECOMMISSIONING and not in the gracefullyDecommissionableNodes set, but is an excluded node in the HostDetails, so it is sent a SHUTDOWN signal # node is added to gracefullyDecommissionableNodes set # event handler transitions node to state DECOMMISSIONING at some point This would be fixed if you used an AtomicReference for your set of "gracefullyDecommissionableNodes" and swapped out the reference, similar to how you handled the HostDetails. Alternatively, instead of using an asynchronous event handler to update the state of the nodes to DECOMMISSIONING, you could update the state synchronously. You could grab a lock, then update HostDetails and synchronously update the states of the nodes being gracefully decommissioned, then release the lock. When the resource tracker service receives a heartbeat and needs to check if a node should be shutdown (if it is excluded and in state decommissioning), it would grab the lock right before doing the check. Having the resource tracker service wait on a lock doesn't sound great, but it would likely be on the order of milliseconds, and only when refresh nodes is called. > Race condition during decommissioning > ------------------------------------- > > Key: YARN-9011 > URL: https://issues.apache.org/jira/browse/YARN-9011 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 3.1.1 > Reporter: Peter Bacsko > Assignee: Peter Bacsko > Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.4 > > Attachments: YARN-9011-001.patch, YARN-9011-002.patch, > YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, > YARN-9011-006.patch, YARN-9011-007.patch, YARN-9011-008.patch, > YARN-9011-009.patch, YARN-9011-branch-3.1.001.patch, > YARN-9011-branch-3.2.001.patch > > > During internal testing, we found a nasty race condition which occurs during > decommissioning. > Node manager, incorrect behaviour: > {noformat} > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:00:17,634 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 > hostname:node-6.hostname.com > {noformat} > Node manager, expected behaviour: > {noformat} > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received > SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting > down. > 2018-06-18 21:07:37,377 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from > ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be > decommissioned > {noformat} > Note the two different messages from the RM ("Disallowed NodeManager" vs > "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an > inconsistent state of nodes while they're being updated: > {noformat} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader > include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} > exclude:{node-6.hostname.com} > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully > decommission node node-6.hostname.com:8041 with state RUNNING > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: > node-6.hostname.com > 2018-06-18 21:00:17,576 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node > node-6.hostname.com:8041 in DECOMMISSIONING. > 2018-06-18 21:00:17,575 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn > IP=172.26.22.115 OPERATION=refreshNodes TARGET=AdminService > RESULT=SUCCESS > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve > original total capability: <memory:8192, vCores:8> > 2018-06-18 21:00:17,577 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING > {noformat} > When the decommissioning succeeds, there is no output logged from > {{ResourceTrackerService}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org