Andrew Chung created YARN-11002:
-----------------------------------
Summary: Resource-aware Opportunistic Container Allocation
Key: YARN-11002
URL: https://issues.apache.org/jira/browse/YARN-11002
Project: Hadoop YARN
Issue Type: New Feature
Components: container-queuing, resourcemanager
Reporter: Andrew Chung
Assignee: Andrew Chung
Currently, Centralized opportunistic container (OContainer) allocation only
considers queue length on nodes.
However, based on production experience, we found that relying only on queue
length for allocating OContainers can lead to very high queue times.
Namely, while we rely on queue length to sort the candidate nodes for
OContainer allocation, the queue length used in computation can be
stale/outdated, depending on node HB frequency, leading to an exceedingly high
number of queued nodes. This excessive queueing can lead to queued vertices on
the critical path of our jobs blocking the rest of the job's vertices. Here's
an example illustrating what can happen:
Suppose an NM can run 3 containers, and has no allocated containers to start
with.
AM -> RM: heartbeat, allocate 6 containers
RM.NodeQueueLoadMonitor: nothing allocated on NM now, mark 3 containers on NM
as allocated to AM, 3 containers from AM outstanding.
RM -> AM: allocate 3 containers on NM
NM -> RM: heartbeat with no containers allocated
RM.NodeQueueLoadMonitor: mark 0 containers allocated on NM
AM -> NM: run 3 containers
NM ->AM : ACK, run 3 containers, no more capacity on NM
AM -> RM: heartbeat
RM.NodeQueueLoadMonitor: nothing allocated on NM now, mark 3 more containers on
NM as allocated to AM (in reality, there should be 3 containers already
allocated on NM, and the 3 marked in this heartbeat will be queued), no
containers from AM outstanding.
RM -> AM: allocate the 3 remaining containers on NM
We've found that, if OContainers are unlucky, they can be co-located with
vertices that run for an excessively long period of time, requiring them to
wait a long time before they can begin to run. This led to certain jobs to
experience a > 3x increase in median run time. In these cases, it is better for
the OContainer to remain un-released at the RM until NMs report available
resources.
To address this, we propose a new allocation policy
{{QUEUE_LENGTH_THEN_RESOURCES}}, orthogonal to the original {{QUEUE_LENGTH}}
policy, that considers the available node resources when allocating OContainers.
It first sorts the candidate nodes by queue length in an ascending fashion,
then by available resources.
In the case where the maximum queue length on nodes is zero, it simply waits
until a node reports available resources to run OContainers.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]