Prabhu Joseph created YARN-10352:
------------------------------------

             Summary: MultiNode Placament assigns container on stopped 
NodeManagers
                 Key: YARN-10352
                 URL: https://issues.apache.org/jira/browse/YARN-10352
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: Prabhu Joseph
            Assignee: Prabhu Joseph


When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
Active Nodes will be still having those stopped nodes until NM Liveliness 
Monitor Expires after configured timeout 
(yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
Multi Node Placement assigns the containers on those nodes. They need to 
exclude the nodes which has not heartbeated for configured heartbeat interval 
(yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
Asynchronous Capacity Scheduler Threads. 
(CapacityScheduler#shouldSkipNodeSchedule)


*Repro:*

1. Enable Multi Node Placement 
(yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery Enabled  
(yarn.node.recovery.enabled)

2. Have only one NM running say worker0

3. Stop worker0 and start any other NM say worker1

4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
worker0.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to