Prabhu Joseph created YARN-10352: ------------------------------------ Summary: MultiNode Placament assigns container on stopped NodeManagers Key: YARN-10352 URL: https://issues.apache.org/jira/browse/YARN-10352 Project: Hadoop YARN Issue Type: Bug Reporter: Prabhu Joseph Assignee: Prabhu Joseph
When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM Active Nodes will be still having those stopped nodes until NM Liveliness Monitor Expires after configured timeout (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, Multi Node Placement assigns the containers on those nodes. They need to exclude the nodes which has not heartbeated for configured heartbeat interval (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to Asynchronous Capacity Scheduler Threads. (CapacityScheduler#shouldSkipNodeSchedule) *Repro:* 1. Enable Multi Node Placement (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery Enabled (yarn.node.recovery.enabled) 2. Have only one NM running say worker0 3. Stop worker0 and start any other NM say worker1 4. Submit a sleep job. The containers will timeout as assigned to stopped NM worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org