[jira] [Comment Edited] (YARN-6808) Allow Schedulers to return OPPORTUNISTIC containers when queues go over configured capacity

Arun Suresh (JIRA) Fri, 14 Jul 2017 08:51:41 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-6808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16087500#comment-16087500
 ]


Arun Suresh edited comment on YARN-6808 at 7/14/17 3:50 PM:
------------------------------------------------------------

[~leftnoteasy], good questions..

bq. Use opportunistic container to do lazy preemption in NM. (Is there any 
umbrella JIRA for this?)
Technically, this is the default behavior for opportunistic containers - as it 
is today. Opp containers are killed in the NM when a Guaranteed container is 
started by an AM - if the NM at that point does not have resources to start the 
guaranteed container. We are also working on YARN-5972 which adds some amount 
of work preservation to this by - instead of killing the Opp container, we 
PAUSE it. PAUSE will be supported using the cgroups 
[freezer|https://www.kernel.org/doc/Documentation/cgroup-v1/freezer-subsystem.txt]
 module in linux and Windows JobObjects (we are using this in production 
actually)

bq. Let's say app1 in an underutilized queue, which want to preempt containers 
from an over-utilized queue. Will preemption happens if app1 asks opportunistic 
container?
I am assuming by under-utilized, you mean starved. So currently, if app1 
SPECIFICALLY asks for Opp containers it will get them irrespective of where the 
queue is underutilized or not. Opp containers ALLOCATION today are not limited 
by queue/cluster capacity today - It is just limited by the length of queued 
containers on Nodes (YARN-1011 will in time place stricter capacity limits by 
allocating only if the allocated resources are not being used). Opp containers 
EXECUTION is obviously bound by available resources on the NM, and like I 
mentioned earlier, running Opp containers will be killed to make room for any 
Guaranteed container.

bq. For target #1, who make the decision of moving guaranteed containers to 
opportunistic containers. If it is still decided by central RM, does that mean 
preemption logics in RM are same as today except kill operation is decided by 
NM side? 
Yes, it is RM. Currently in both the Schedulers, after a container is 
allocated, candidates for preemption are chosen from containers of apps from 
queues which are above capacity - then the RM aks the NM to preempt the 
containers. What the latest patch (002) here does is: Allocation of containers 
happen in the same code path - but right before handing the container to the 
AM, it checks if the queue capacity is exceeded - If so, downgrade the 
container to Opp. Thus technically, the same apps/containers that were a target 
for normal preemption will become candidates for preemption at the NM. There 
are obviously improvements that can be made - like that I mentioned in the 
phase 2 of the JIRA in the description - where, in addition to downgrading over 
cap containers to Opp, we can upgrade running Opp containers to Guaranteed for 
apps when some of their Guaranteed containers complete.
Like I mentioned, we are still prototyping - we are running tests now to 
collect data - will keep you guys posted on results.

bq. For overall opportunistic container execution: If OC launch request will be 
queued by NM, it may wait a long time before get executed. In this case, do we 
need to modify AM code to: a. expect longer delay before think the launch 
fails. b. asks more resource on different hosts since there's no guaranteed 
launch time for OC?
So, with YARN-4597, we had introduced a container state called SCHEDULED. A 
container is in the scheduled state while it is locallizing or if it is in the 
queue. Essentially, the extra delay will look just like localization delay to 
the AM. We have verified this is fine for MapReduce and Spark.

bq. What happens if an app doesn't want to ask opportunistic container when go 
beyond headroom? (Such as online services). I think this should be a per-app 
config (give me OC when I'm go beyond headroom).
A per app config makes sense. But currently today, the ResourceRequest has a 
field called {{ExecutionTypeRequest}} which in addition to the 
{{ExecutionType}} also has an {{enforeExecutionType}} flag. By default, this is 
false - but if set to true, my latest patch ensures that only Guaranteed 
containers are returned. I have added a test case to ensure that as well.

bq. Existing patch makes static decision, which happens when new resource 
request added by AM. Should this be reconsidered when app's headroom changed 
over time?
So, my latest patch (002) kind of addresses this. What I do now, is the 
decision is made after container allocation. Also, now I am ignoring the 
headroom. I am downgrading if at the time of Container allocation, only if the 
queue capacity is exceeded. The existing code paths ensure that max-capacity of 
queues are never exceeded anyway.




was (Author: asuresh):
[~leftnoteasy], good questions..

bq. Use opportunistic container to do lazy preemption in NM. (Is there any 
umbrella JIRA for this?)
Technically, this is the default behavior for opportunistic containers - as it 
is today. Opp containers are killed in the NM when a Guaranteed container is 
started by an AM - if the NM at that point does not have resources to start the 
guaranteed container. We are also working on YARN-5972 which adds some amount 
of work preservation to this by - instead of killing the Opp container, we 
PAUSE it. PAUSE will be supported using the cgroups 
[freezer|https://www.kernel.org/doc/Documentation/cgroup-v1/freezer-subsystem.txt]
 module in linux and Windows JobObjects (we are using this in production 
actually)

bq. Let's say app1 in an underutilized queue, which want to preempt containers 
from an over-utilized queue. Will preemption happens if app1 asks opportunistic 
container?
I am assuming by under-utilized, you mean starved. So currently, if app1 
SPECIFICALLY asks for Opp containers it will get them irrespective of where the 
queue is underutilized or not. Opp containers ALLOCATION today are not limited 
by queue/cluster capacity today - It is just limited by the length of queued 
containers on Nodes (YARN-1011 will in time place stricter capacity limits by 
allocating only if the allocated resources are not being used). Opp containers 
EXECUTION is obviously bound by available resources on the NM, and like I 
mentioned earlier, running Opp containers will be killed to make room for any 
Guaranteed container.

bq. For target #1, who make the decision of moving guaranteed containers to 
opportunistic containers. If it is still decided by central RM, does that mean 
preemption logics in RM are same as today except kill operation is decided by 
NM side? 
Yes, it is RM. Currently in both the Schedulers, after a container is 
allocated, candidates for preemption are chosen from containers of apps from 
queues which are above capacity - then the RM aks the NM to preempt the 
containers. What the latest patch (002) here does is: Allocation of containers 
happen in the same code path - but right before handing the container to the 
AM, it checks if the queue capacity is exceeded - If so, downgrade the 
container to Opp. Thus technically, the same apps/containers that were a target 
for normal preemption will become candidates for preemption at the NM. There 
are obviously improvements - like that I mentioned in the phase 2 of the JIRA 
in the description - where, in addition to downgrading over cap containers to 
Opp, we can upgrade running Opp containers to Guaranteed for apps when some of 
their Guaranteed containers complete.
Like I mentioned, we are still prototyping - we are running tests now to 
collect data - will keep you guys posted on results.

bq. For overall opportunistic container execution: If OC launch request will be 
queued by NM, it may wait a long time before get executed. In this case, do we 
need to modify AM code to: a. expect longer delay before think the launch 
fails. b. asks more resource on different hosts since there's no guaranteed 
launch time for OC?
So, with YARN-4597, we had introduced a container state called SCHEDULED. A 
container is in the scheduled state while it is locallizing or if it is in the 
queue. Essentially, the extra delay will look just like localization delay to 
the AM. We have verified this is fine for MapReduce and Spark.

bq. What happens if an app doesn't want to ask opportunistic container when go 
beyond headroom? (Such as online services). I think this should be a per-app 
config (give me OC when I'm go beyond headroom).
A per app config makes sense. But currently today, the ResourceRequest has a 
field called {{ExecutionTypeRequest}} which in addition to the 
{{ExecutionType}} also has an {{enforeExecutionType}} flag. By default, this is 
false - but if set to true, my latest patch ensures that only Guaranteed 
containers are returned. I have added a test case to ensure that as well.

bq. Existing patch makes static decision, which happens when new resource 
request added by AM. Should this be reconsidered when app's headroom changed 
over time?
So, my latest patch (002) kind of addresses this. What I do now, is the 
decision is made after container allocation. Also, now I am ignoring the 
headroom. I am downgrading if at the time of Container allocation, only if the 
queue capacity is exceeded. The existing code paths ensure that max-capacity of 
queues are never exceeded anyway.



> Allow Schedulers to return OPPORTUNISTIC containers when queues go over 
> configured capacity
> -------------------------------------------------------------------------------------------
>
>                 Key: YARN-6808
>                 URL: https://issues.apache.org/jira/browse/YARN-6808
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-6808.001.patch, YARN-6808.002.patch
>
>
> This is based on discussions with [~kasha] and [~kkaranasos].
> Currently, when a Queues goes over capacity, apps on starved queues must wait 
> either for containers to complete or for them to be pre-empted by the 
> scheduler to get resources.
> This JIRA proposes to allow Schedulers to:
> # Allocate all containers over the configured queue capacity/weight as 
> OPPORTUNISTIC.
> # Auto-promote running OPPORTUNISTIC containers of apps as and when their 
> GUARANTEED containers complete.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-6808) Allow Schedulers to return OPPORTUNISTIC containers when queues go over configured capacity

Reply via email to