[
https://issues.apache.org/jira/browse/YARN-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146157#comment-15146157
]
Arun Suresh commented on YARN-4692:
-----------------------------------
Thanks you for starting this [~vinodkv]. The Document itself looks pretty
thorough and well thought out.
Couple of thoughts :
# Preemption and Reservation:
## The document (3.2.1) talks about the fact that Long Running (LR) Containers
should be started on assured capacity (not resources over fair share). I posit
LR Containers should *primarily* be start on over-committed resources (probably
as {{OPPORTUNISTIC}} containers, see YARN-2882 and YARN-1011). The point of LR
services is that the Service as a whole should be available. Individual
container deaths/restarts should not affect the service.
## On a related note, we can give applications the ability to specify
*Preemptability* of containers in a particular role. A low value could mean,
preemption is very costly while a high value implies the service is still
available if some containers die. For eg. if deploying HBase on YARN, HBase
Master can have a *low* preemptability value while HBase Region Servers can
probably have *higher* preemptability.
## Allow LR Applications to specify *peak*, *min* and *variance*/*mean* (also
many transient and steady-state) of a Resource request to allow schedulers to
make better allocation decisions. Also allow users to specify *min*/*max* num
containers required for a particular Service role. This can be used as a hint
for Preemption if other short running tasks are starved.
## Currently Schedulers create a reservation for a container on a node with
free resources but resource does not fit. The document suggests we should
ensure that Nodes on which LR containers are already running should not accept
reservations. I feel, we should leverage
Peak/Min/Mean/Varience/transient/Steady-state resource demands to loosen this.
For eg, even if Node may not satisfy Peak demand, if steady-state demand is
satisfiable, the Peak demands can probably be met by a combination of
leveraging YARN-2877 / YARN-1011 and YARN-4597 (I'll describe this below).
# Handling Low-latency resource Spikes in LR Containers:
## In YARN-4597 [~chris.douglas] proposed 1) new {{SCHEDULING}} container state
2) a local *ContainerScheduler* that handles the scheduling (essentially in
charge of moving container from {{SCHEDULING}} to {{RUNNING}} state) 3)
Allowing the *ContainerScheduler* and *Localizer* be directly accessible to
Containers running on the node.
## An LR container should be able to ask for more resources if required and
shed excess resource when idling. YARN-1197 tried to add support for changing
resources on an allocated container, but the design doc talks about the request
making a round trip from AM to RM and back and then to the containers.
Low-latency elasticity can be probably be achieved using a combination of
YARN-2877 and leveraging the NM local ContainerScheduler
# Queue Modeling:
## When LR Tasks are mixed with Short running Tasks, since LR tasks may never
end, resources might always be tied up. I foresee some alleviation of this by
probably ensuring some % of queue cap always available for non-LR tasks. Also,
probably some more intelligent resource accounting using the Reservation system
YARN-1051 would help ?
> [Umbrella] Simplified and first-class support for services in YARN
> ------------------------------------------------------------------
>
> Key: YARN-4692
> URL: https://issues.apache.org/jira/browse/YARN-4692
> Project: Hadoop YARN
> Issue Type: New Feature
> Reporter: Vinod Kumar Vavilapalli
> Assignee: Vinod Kumar Vavilapalli
> Attachments:
> YARN-First-Class-And-Simplified-Support-For-Services-v0.pdf
>
>
> YARN-896 focused on getting the ball rolling on the support for services
> (long running applications) on YARN.
> I’d like propose the next stage of this effort: _Simplified and first-class
> support for services in YARN_.
> The chief rationale for filing a separate new JIRA is threefold:
> - Do a fresh survey of all the things that are already implemented in the
> project
> - Weave a comprehensive story around what we further need and attempt to
> rally the community around a concrete end-goal, and
> - Additionally focus on functionality that YARN-896 and friends left for
> higher layers to take care of and see how much of that is better integrated
> into the YARN platform itself.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)