[
https://issues.apache.org/jira/browse/YARN-568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649126#comment-13649126
]
Carlo Curino commented on YARN-568:
-----------------------------------
Sandy, I agree with your summary of the FS mechanics, and you raise important
questions that I try to address below.
The idea behind the preemption we are introducing is to prempt first and kill
later to allow the AM to "save" its work before killing (in the CS we go a step
further and let the AM pick the containers but it is a bit trickier so I would
leave it out for the time being). This requires us to be "consistent" in how we
pick the containers and first ask nicely, and then kill the same containers if
the AM is ignoring us or being too slow. This is needed to give a consistent
view of the RM needs to the AM. Assuming we are being consistent in picking
containers, I think the simple mechanics we posted should be ok.
Now how can we get there:
1) This translate in a deterministic choice of containers across invocations of
the preemption procedures. Sorting by priority is a first step in that
direction (although as I commented [here |
https://issues.apache.org/jira/browse/YARN-569?focusedCommentId=13638825&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13638825]
there are some other issues with that). Adding reverse-container-ordering
might help guarantee the picking order is consistent (missing now). In
particular, if the need for preemption is consistent over time, no new
containers would be granted to this app, so picking from the "tail" should
yield a consistent set of containers (minus the one naturally expiring, which
would be accounted in future run as a reduced preemption need). On the other
hand if the cluster conditions change drastically enough (e.g., big job
finishes) and there is no more need to kill some containers from this app, we
save the cost of kill and reschedule. In a sense, instead of looking at an
instantaneous need for preemption every 15sec, we check every 5 seconds and
only kill when there is a sustained need for a window of
>maxWaitTimeBeforeKill. I think that if we can get this to work as intended we
would get a better overall policy (less jitter).
2) toPreempt is decremented in all three cases because we would otherwise
double-kill for the same resource needs: imagine you want 5 containers and send
corresponding preemption requests,
while the AMs are working on preemption, the preemption procedure is called
again and re-detects that we want 5 containers back. If you don't account for
the pending requests (i.e., decrementing toPreempt for those too) you would
pick (preempt or kill) another 5 containers (depending on time constants this
could happen more than twice)... now we are forcing the AM to release 10(or
more) containers for a 5 containers preemption need. Anyway, I agree that once
we converge on this we should comment it out clearly in the code, this seems
the kind of code that people would try to "fix" :-). The shift you spotted with
this comment is between running "rarely enough" so that all the actions
initiated during a previous run are fully reflected in the current cluster
state, to run frequently enough that the actions we are taking might not be
visible yet. This force us to do some more bookeeping and have robust
heuristics, but I think it is work the improvement in the scheduler behavior.
3) It is probably good to have a "no-preemption" mode in which we simply
straight kill. However, by setting the time constant right (e.g.,
preemptionInterval 5sec and maxWaitTimeBeforeKill to 10sec) you would get the
same effect of having a hard kill at most 15sec after there is a need for
preemption, but for every preemption-aware AM we could save the progress made
so far. In our current MR implementation of preemption, you might get
containers back even faster, as we release containers once we are done
checkpointing. Note that since we are not actually killing at every
preemptionInterval we could set that very low (if performance of the FS allow
it) and get more points of observation and faster reaction times, while
maxWaitTimeBeforeKill would be tuned as a tradeoff between giving the AM enough
time to preempt and speed of rebalance.
I will look into adding the allocation-order as a second-level ordering for
containers. Please let me know whether this seems enough or I am missing
something.
> FairScheduler: support for work-preserving preemption
> ------------------------------------------------------
>
> Key: YARN-568
> URL: https://issues.apache.org/jira/browse/YARN-568
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: scheduler
> Reporter: Carlo Curino
> Assignee: Carlo Curino
> Attachments: YARN-568.patch, YARN-568.patch
>
>
> In the attached patch, we modified the FairScheduler to substitute its
> preemption-by-killling with a work-preserving version of preemption (followed
> by killing if the AMs do not respond quickly enough). This should allows to
> run preemption checking more often, but kill less often (proper tuning to be
> investigated). Depends on YARN-567 and YARN-45, is related to YARN-569.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira