[
https://issues.apache.org/jira/browse/YARN-11002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Chung updated YARN-11002:
--------------------------------
Priority: Minor (was: Major)
> Resource-aware Opportunistic Container Allocation
> -------------------------------------------------
>
> Key: YARN-11002
> URL: https://issues.apache.org/jira/browse/YARN-11002
> Project: Hadoop YARN
> Issue Type: New Feature
> Components: container-queuing, resourcemanager
> Reporter: Andrew Chung
> Assignee: Andrew Chung
> Priority: Minor
>
> Currently, Centralized opportunistic container (OContainer) allocation only
> considers queue length on nodes.
> However, based on production experience, we found that relying only on queue
> length for allocating OContainers can lead to very high queue times.
> Namely, while we rely on queue length to sort the candidate nodes for
> OContainer allocation, the queue length used in computation can be
> stale/outdated, depending on node HB frequency, leading to an exceedingly
> high number of queued nodes. This excessive queueing can lead to queued
> vertices on the critical path of our jobs blocking the rest of the job's
> vertices. Here's an example illustrating what can happen:
> Suppose an NM can run 3 containers, and has no allocated containers to start
> with.
> AM -> RM: heartbeat, allocate 6 containers
> RM.NodeQueueLoadMonitor: nothing allocated on NM now, mark 3 containers on NM
> as allocated to AM, 3 containers from AM outstanding.
> RM -> AM: allocate 3 containers on NM
> NM -> RM: heartbeat with no containers allocated
> RM.NodeQueueLoadMonitor: mark 0 containers allocated on NM
> AM -> NM: run 3 containers
> NM ->AM : ACK, run 3 containers, no more capacity on NM
> AM -> RM: heartbeat
> RM.NodeQueueLoadMonitor: nothing allocated on NM now, mark 3 more containers
> on NM as allocated to AM (in reality, there should be 3 containers already
> allocated on NM, and the 3 marked in this heartbeat will be queued), no
> containers from AM outstanding.
> RM -> AM: allocate the 3 remaining containers on NM
> We've found that, if OContainers are unlucky, they can be co-located with
> vertices that run for an excessively long period of time, requiring them to
> wait a long time before they can begin to run. This led to certain jobs to
> experience a > 3x increase in median run time. In these cases, it is better
> for the OContainer to remain un-released at the RM until NMs report available
> resources.
> To address this, we propose a new allocation policy
> {{QUEUE_LENGTH_THEN_RESOURCES}}, orthogonal to the original {{QUEUE_LENGTH}}
> policy, that considers the available node resources when allocating
> OContainers.
> It first sorts the candidate nodes by queue length in an ascending fashion,
> then by available resources.
> Akin to the {{QUEUE_LENGTH}} policy, the new {{QUEUE_LENGTH_THEN_RESOURCES}}
> policy will still pick a random node out of a configurable K nodes for
> allocation.
> In the case where the maximum queue length on nodes is zero, it simply waits
> until a node reports available resources to run OContainers.
> This JIRA issue will serve as an umbrella issue that encapsulates sub-tasks
> for this work.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]