Prabhu Joseph created YARN-10352:
------------------------------------
Summary: MultiNode Placament assigns container on stopped
NodeManagers
Key: YARN-10352
URL: https://issues.apache.org/jira/browse/YARN-10352
Project: Hadoop YARN
Issue Type: Bug
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph
When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM
Active Nodes will be still having those stopped nodes until NM Liveliness
Monitor Expires after configured timeout
(yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins,
Multi Node Placement assigns the containers on those nodes. They need to
exclude the nodes which has not heartbeated for configured heartbeat interval
(yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to
Asynchronous Capacity Scheduler Threads.
(CapacityScheduler#shouldSkipNodeSchedule)
*Repro:*
1. Enable Multi Node Placement
(yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery Enabled
(yarn.node.recovery.enabled)
2. Have only one NM running say worker0
3. Stop worker0 and start any other NM say worker1
4. Submit a sleep job. The containers will timeout as assigned to stopped NM
worker0.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]