[ 
https://issues.apache.org/jira/browse/YARN-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146157#comment-15146157
 ] 

Arun Suresh commented on YARN-4692:
-----------------------------------

Thanks you for starting this [~vinodkv]. The Document itself looks pretty 
thorough and well thought out.

Couple of thoughts :

# Preemption and Reservation:
## The document (3.2.1) talks about the fact that Long Running (LR) Containers 
should be started on assured capacity (not resources over fair share). I posit 
LR Containers should *primarily* be start on over-committed resources (probably 
as {{OPPORTUNISTIC}} containers, see YARN-2882 and YARN-1011). The point of LR 
services is that the Service as a whole should be available. Individual 
container deaths/restarts should not affect the service.
## On a related note, we can give applications the ability to specify 
*Preemptability* of containers in a particular role. A low value could mean, 
preemption is very costly while a high value implies the service is still 
available if some containers die. For eg. if deploying HBase on YARN, HBase 
Master can have a *low* preemptability value while HBase Region Servers can 
probably have *higher* preemptability. 
## Allow LR Applications to specify *peak*, *min* and *variance*/*mean* (also 
many transient and steady-state) of a Resource request to allow schedulers to 
make better allocation decisions. Also allow users to specify *min*/*max* num 
containers required for a particular Service role. This can be used as a hint 
for Preemption if other short running tasks are starved.
## Currently Schedulers create a reservation for a container on a node with 
free resources but resource does not fit. The document suggests we should 
ensure that Nodes on which LR containers are already running should not accept 
reservations. I feel, we should leverage 
Peak/Min/Mean/Varience/transient/Steady-state resource demands to loosen this. 
For eg, even if Node may not satisfy Peak demand, if steady-state demand is 
satisfiable, the Peak demands can probably be met by a combination of 
leveraging YARN-2877 / YARN-1011 and YARN-4597 (I'll describe this below).
# Handling Low-latency resource Spikes in LR Containers:
## In YARN-4597 [~chris.douglas] proposed 1) new {{SCHEDULING}} container state 
2) a local *ContainerScheduler* that handles the scheduling (essentially in 
charge of moving container from {{SCHEDULING}} to {{RUNNING}} state) 3) 
Allowing the *ContainerScheduler* and *Localizer* be directly accessible to 
Containers running on the node.
## An LR container should be able to ask for more resources if required and 
shed excess resource when idling. YARN-1197 tried to add support for changing 
resources on an allocated container, but the design doc talks about the request 
making a round trip from AM to RM and back and then to the containers. 
Low-latency elasticity can be probably be achieved using a combination of 
YARN-2877 and leveraging the NM local ContainerScheduler
# Queue Modeling:
## When LR Tasks are mixed with Short running Tasks, since LR tasks may never 
end, resources might always be tied up. I foresee some alleviation of this by 
probably ensuring some % of queue cap always available for non-LR tasks. Also, 
probably some more intelligent resource accounting using the Reservation system 
YARN-1051 would help ?





> [Umbrella] Simplified and first-class support for services in YARN
> ------------------------------------------------------------------
>
>                 Key: YARN-4692
>                 URL: https://issues.apache.org/jira/browse/YARN-4692
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: 
> YARN-First-Class-And-Simplified-Support-For-Services-v0.pdf
>
>
> YARN-896 focused on getting the ball rolling on the support for services 
> (long running applications) on YARN.
> I’d like propose the next stage of this effort: _Simplified and first-class 
> support for services in YARN_.
> The chief rationale for filing a separate new JIRA is threefold:
>  - Do a fresh survey of all the things that are already implemented in the 
> project
>  - Weave a comprehensive story around what we further need and attempt to 
> rally the community around a concrete end-goal, and
>  - Additionally focus on functionality that YARN-896 and friends left for 
> higher layers to take care of and see how much of that is better integrated 
> into the YARN platform itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to