Bikas Saha commented on YARN-1011:

Good points but let me play the devils advocate to get some more clarity :)
bq. As soon as we realize the perf is slower because the node has higher usage 
than we had anticipated, we preempt the container and retry allocation 
(guaranteed or opportunistic depending on the new cluster state). So, it 
shouldn't run slower for longer than our monitoring interval. Is this 
assumption okay?
How do we determine that the perf is slower? The CPU would never exceed 100% 
even under over-allocation. Is preempting always necessary? If we are sure that 
the OS is going to starve the opportunistic containers, then can assume that 
when the node is fully utilized, then only our guaranteed containers are using 
resources? So we can let the opportunistic containers be so that they can start 
soaking up excess capacity after the normal containers have stopped spiking. 
Perhaps some experiments will shed some light on this.

bq. The opportunistic container will continue to run on this node so long as it 
is getting the resources it needs. If there is any sort of resource contention, 
it is preempted and is up for allocation on one of the free nodes.
Lets say job capacity is 1 container and the job asks for 2. Its get 1 normal 
container and 1 opportunistic container. Now it releases its 1 normal 
container. At this point what happens to the opportunistic container. It is 
clearly running at lower priority on the node and as such we are not giving the 
job its guaranteed capacity. The question is not about finding an optimal 
solution for this problem (and there may not be one). The issue here is to 
crisply define the semantics around scheduling in the design. Whatever the 
semantics are, we should clearly know what they are. IMO, the exact semantics 
of scheduling should be in the docs.

bq. The RM schedules the next highest priority "task" for which it couldn't 
find a guaranteed container as an opportunistic container. This task continues 
to run as long as it is not getting enough resources. If there is no resource 
contention, the task continues to run. If guaranteed resources free up on the 
node it is running, isn't it fair to promote the container to Guaranteed.
Sure. And thats why the system should upgrade opportunistic containers in the 
order in which they were allocated. However, the decision must be made at the 
RM and not the NM since the NMs dont know about total capacity and multiple NMs 
locally upgrading their opportunistic containers might end up over-allocating 
for a job. Further, the queue sharing state may have changed since the 
opportunistic allocation, and hence assuming that the opportunistic container 
"would have" gotten that allocation anyways, at a later point in time, may not 
be valid.

In summary, what we need in the document is a clear definition of the 
scheduling policy around this - whatever that policy may be.

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -------------------------------------------------------------------------------------
>                 Key: YARN-1011
>                 URL: https://issues.apache.org/jira/browse/YARN-1011
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Arun C Murthy
>         Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.

This message was sent by Atlassian JIRA

Reply via email to