Peter Bacsko created YARN-9011:
----------------------------------

             Summary: Race condition during decommissioning
                 Key: YARN-9011
                 URL: https://issues.apache.org/jira/browse/YARN-9011
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 3.1.1
            Reporter: Peter Bacsko
            Assignee: Antal Bálint Steinbach


During internal testing, we found a nasty race condition which occurs during 
decommissioning.

Node manager, incorrect behaviour:
{noformat}
2018-06-18 21:00:17,634 WARN 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting down.
2018-06-18 21:00:17,634 WARN 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
hostname:node-6.hostname.com
{noformat}

Node manager, expected behaviour:
{noformat}
2018-06-18 21:07:37,377 WARN 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting down.
2018-06-18 21:07:37,377 WARN 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
decommissioned
{noformat}

Note the two different messages from the RM ("Disallowed NodeManager" vs 
"DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
inconsistent state of nodes while they're being updated:

{noformat}
2018-06-18 21:00:17,575 INFO 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
 exclude:{node-6.hostname.com}
2018-06-18 21:00:17,575 INFO 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
decommission node node-6.hostname.com:8041 with state RUNNING
2018-06-18 21:00:17,575 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
node-6.hostname.com
2018-06-18 21:00:17,576 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
node-6.hostname.com:8041 in DECOMMISSIONING.
2018-06-18 21:00:17,575 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn     
IP=172.26.22.115        OPERATION=refreshNodes  TARGET=AdminService     
RESULT=SUCCESS
2018-06-18 21:00:17,577 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
original total capability: <memory:8192, vCores:8>
2018-06-18 21:00:17,577 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
{noformat}

When the decommissioning succeeds, there is no output logged from 
{{ResourceTrackerService}}.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to