[
https://issues.apache.org/jira/browse/YARN-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13630769#comment-13630769
]
Bikas Saha commented on YARN-45:
--------------------------------
I like the idea of the RM giving information to the AM about actions that it
might take which will affect the AM. However, I am wary of having the action
taken in different places. eg. the KILL to the containers should come from the
RM or the AM exclusively but not from both. Otherwise we open ourselves up to
race conditions, unnecessary kills and complex logic in the RM.
Preemption is something that, IMO the RM needs to do at the very last moment
when there is no other alternative of resource being freed up. If we decide to
preempt at time T1 and then actually preempt at time T2 then the cluster
conditions may have changed between T1 and T2 which may invalidate the
decisions taken at T1. New resources may have freed up that reduce the number
of containers to be killed. This sub-optimality is directly proportional to
length of time between T1 and T2. So ideally we want to keep T1=T2. One can
argue that things can change after the preemption which may have made the
preemption unnecessary. So the above argument of T1=T2 is fallacious. However,
preemption policies are usually based on deadlines such as the allocation of
queue1 must be met within X seconds. So RM does not have the luxury of waiting
for X+1 seconds. The best it can do is to wait upto X seconds in the hope that
things will work out and at X redistribute resources to meet the deficit.
At the same time, I can see that there is an argument that the AM knows best
how to free up its resources. It will be good to remember that the AM has
already informed the RM about the importance of all its containers when it made
the requests at different priorities. So the RM knows the order of importance
of the containers and the RM also knows the amount of time each container has
been allocated. Assuming container runtime as a proxy for container work done,
this data can be used by the RM to preempt in a work preserving manner without
having to talk to the AM.
Notifying the AM has the usefulness of allowing the AM to take actions that
preserve work such as checkpointing. However, IMO, the AM should only do
checkpointing operations but not kill the containers. That should still happen
at the RM as the very last option at the last moment. If the situation changes
in the grace period and the containers do not need to be killed then there is
no point in the AM killing them right now. This also lets us increase the grace
period to a longer time because checkpointing and preserving work usually means
persisting data in a stable store and may be slow in practical scenarios.
To summarize, I would propose an API in which the RM tells the AM about exactly
which containers it might imminently preempt with the contract being that the
AM could take actions to preserve the work done in those containers. The AM can
continue to run those containers until the RM actually preempts them if needed.
If we really think that the choice of containers needs to be made at the AM
then the AM needs to checkpoint those containers and inform the RM about the
containers it has chosen. But the final decision to send the kill must be sent
by the RM.
> Scheduler feedback to AM to release containers
> ----------------------------------------------
>
> Key: YARN-45
> URL: https://issues.apache.org/jira/browse/YARN-45
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Chris Douglas
> Assignee: Carlo Curino
> Attachments: YARN-45.patch, YARN-45.patch
>
>
> The ResourceManager strikes a balance between cluster utilization and strict
> enforcement of resource invariants in the cluster. Individual allocations of
> containers must be reclaimed- or reserved- to restore the global invariants
> when cluster load shifts. In some cases, the ApplicationMaster can respond to
> fluctuations in resource availability without losing the work already
> completed by that task (MAPREDUCE-4584). Supplying it with this information
> would be helpful for overall cluster utilization [1]. To this end, we want to
> establish a protocol for the RM to ask the AM to release containers.
> [1] http://research.yahoo.com/files/yl-2012-003.pdf
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira