Peter Bacsko created YARN-9011: ---------------------------------- Summary: Race condition during decommissioning Key: YARN-9011 URL: https://issues.apache.org/jira/browse/YARN-9011 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.1.1 Reporter: Peter Bacsko Assignee: Antal Bálint Steinbach
During internal testing, we found a nasty race condition which occurs during decommissioning. Node manager, incorrect behaviour: {noformat} 2018-06-18 21:00:17,634 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting down. 2018-06-18 21:00:17,634 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 hostname:node-6.hostname.com {noformat} Node manager, expected behaviour: {noformat} 2018-06-18 21:07:37,377 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting down. 2018-06-18 21:07:37,377 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be decommissioned {noformat} Note the two different messages from the RM ("Disallowed NodeManager" vs "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an inconsistent state of nodes while they're being updated: {noformat} 2018-06-18 21:00:17,575 INFO org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} exclude:{node-6.hostname.com} 2018-06-18 21:00:17,575 INFO org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully decommission node node-6.hostname.com:8041 with state RUNNING 2018-06-18 21:00:17,575 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: node-6.hostname.com 2018-06-18 21:00:17,576 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node node-6.hostname.com:8041 in DECOMMISSIONING. 2018-06-18 21:00:17,575 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn IP=172.26.22.115 OPERATION=refreshNodes TARGET=AdminService RESULT=SUCCESS 2018-06-18 21:00:17,577 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve original total capability: <memory:8192, vCores:8> 2018-06-18 21:00:17,577 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING {noformat} When the decommissioning succeeds, there is no output logged from {{ResourceTrackerService}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org