Karthik Kambatla commented on YARN-1011:

bq. Lets say job capacity is 1 container and the job asks for 2. Its get 1 
normal container and 1 opportunistic container. Now it releases its 1 normal 
container. At this point what happens to the opportunistic container. It is 
clearly running at lower priority on the node and as such we are not giving the 
job its guaranteed capacity. 
Momentarily, yes. The RM/NM ensemble (let us discuss that separately) realizes 
this and adjusts by promoting the opportunistic container. Is this different 
from what happens today? Today, the job is allocated one container since that 
is its capacity. Once that is done, it allocates another. Between the first one 
finishing and second one launching, we are not giving the job its guaranteed 

bq. The question is not about finding an optimal solution for this problem (and 
there may not be one). The issue here is to crisply define the semantics around 
scheduling in the design. Whatever the semantics are, we should clearly know 
what they are. IMO, the exact semantics of scheduling should be in the docs.
Agree. I ll add something to the design doc once we capture everyone's 
concerns/suggestions here on JIRA, and may be we could iterate. 

bq. Because of that complexity, I'm not 100% convinced that disfavoring 
OPPORTUNISTIC containers (e.g. low value for cpu_shares) is something that buys 
us very much. 
I don't necessarily see it as disfavoring OPPORTUNISTIC containers. Without 
over-allocation these containers wouldn't even have started. While we are 
optimizing for utilization and throughput, we are just making sure we don't 
adversely affect containers that have been launched prior with promises of 

The low value of cpu_shares only kicks in when the node is highly contended, 
and is intended to be a fail-safe. As long as there are free resources (which I 
believe is the most common case), these OPPORTUNISTIC containers should get a 
sizeable CPU share. No? 

bq. So, hopefully we can make the policy quite configurable so that the amount 
of disfavoring can be tuned for various workloads.
I agree that we might eventually need a configurable policy, but making the 
policy configurable might not be as straight-forward. I am definitely open to 
inputs on simple ways of doing this. Also, it is hard to comment on the 
effectiveness of a simple-but-not-so-configurable policy without implementing 
it and running sample workloads against it.

The simple policy I had in mind was:
# Update the SchedulerNode#getAvailable to include resources that could be 
opportunistically allocated. i.e., max(what_it_says_today, threshold * 
resource). It should be easy to support per-resource thresholds here. 
# At allocate time, label an allocation OPPORTUNISTIC if it takes the 
cumulative allocation over the advertised capacity.
# When space frees up on nodes, NMs send candidate containers for promotion on 
the heartbeat. The RM consults a policy to come up with a list of yes/no 
decisions for each of these candidates. Initially, I would like for the default 
to be yes without any reconsiderations. This favors continuing the execution of 
a container over preempting them. 

Based on what we see, we could tweak this simple policy or come up with more 
sophisticated policies. 

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -------------------------------------------------------------------------------------
>                 Key: YARN-1011
>                 URL: https://issues.apache.org/jira/browse/YARN-1011
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Arun C Murthy
>         Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.

This message was sent by Atlassian JIRA

Reply via email to