[
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161615#comment-17161615
]
Wangda Tan commented on YARN-10352:
-----------------------------------
[~prabhujoseph], I'm trying to understand this logic, why we have two separate
logics to filter outdated nodes? We have one in MultiNodeSortingManager and one
in getNodesHeartbeated. I'm wondering if it is necessary or not.
> Skip schedule on not heartbeated nodes in Multi Node Placement
> --------------------------------------------------------------
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 3.3.0, 3.4.0
> Reporter: Prabhu Joseph
> Assignee: Prabhu Joseph
> Priority: Major
> Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch,
> YARN-10352-003.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM
> Active Nodes will be still having those stopped nodes until NM Liveliness
> Monitor Expires after configured timeout
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins,
> Multi Node Placement assigns the containers on those nodes. They need to
> exclude the nodes which has not heartbeated for configured heartbeat interval
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to
> Asynchronous Capacity Scheduler Threads.
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery
> Enabled (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM
> worker0.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]