[ 
https://issues.apache.org/jira/browse/YARN-568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649126#comment-13649126
 ] 

Carlo Curino commented on YARN-568:
-----------------------------------

Sandy, I agree with your summary of the FS mechanics, and you raise important 
questions that I try to address below. 

The idea behind the preemption we are introducing is to prempt first and kill 
later to allow the AM to "save" its work before killing (in the CS we go a step 
further and let the AM pick the containers but it is a bit trickier so I would 
leave it out for the time being). This requires us to be "consistent" in how we 
pick the containers and first ask nicely, and then kill the same containers if 
the AM is ignoring us or being too slow. This is needed to give a consistent 
view of the RM needs to the AM. Assuming we are being consistent in picking 
containers, I think the simple mechanics we posted should be ok. 

Now how can we get there:

1) This translate in a deterministic choice of containers across invocations of 
the preemption procedures. Sorting by priority is a first step in that 
direction (although as I commented [here | 
https://issues.apache.org/jira/browse/YARN-569?focusedCommentId=13638825&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13638825]
 there are some other issues with that). Adding reverse-container-ordering 
might help guarantee the picking order is consistent (missing now). In 
particular, if the need for preemption is consistent over time, no new 
containers would be granted to this app, so picking from the "tail" should 
yield a consistent set of containers (minus the one naturally expiring, which 
would be accounted in future run as a reduced preemption need). On the other 
hand if the cluster conditions change drastically enough (e.g., big job 
finishes) and there is no more need to kill some containers from this app, we 
save the cost of kill and reschedule. In a sense, instead of looking at an 
instantaneous need for preemption every 15sec, we check every 5 seconds and 
only kill when there is a sustained need for a window of 
>maxWaitTimeBeforeKill. I think that if we can get this to work as intended we 
would get a better overall policy (less jitter). 

2) toPreempt is decremented in all three cases because we would otherwise 
double-kill for the same resource needs: imagine you want 5 containers and send 
corresponding preemption requests,
while the AMs are working on preemption, the preemption procedure is called 
again and re-detects that we want 5 containers back. If you don't account for 
the pending requests (i.e., decrementing toPreempt for those too) you would 
pick (preempt or kill) another 5 containers (depending on time constants this 
could happen more than twice)... now we are forcing the AM to release 10(or 
more) containers for a 5 containers preemption need. Anyway, I agree that once 
we converge on this we should comment it out clearly in the code, this seems 
the kind of code that people would try to "fix" :-). The shift you spotted with 
this comment is between running "rarely enough" so that all the actions 
initiated during a previous run are fully reflected in the current cluster 
state, to run frequently enough that the actions we are taking might not be 
visible yet. This force us to do some more bookeeping and have robust 
heuristics, but I think it is work the improvement in the scheduler behavior.

3) It is probably good to have a "no-preemption" mode in which we simply 
straight kill. However, by setting the time constant right (e.g., 
preemptionInterval 5sec and maxWaitTimeBeforeKill to 10sec) you would get the 
same effect of having a hard kill at most 15sec after there is a need for 
preemption, but for every preemption-aware AM we could save the progress made 
so far. In our current MR implementation of preemption, you might get 
containers back even faster, as we release containers once we are done 
checkpointing. Note that since we are not actually killing at every 
preemptionInterval we could set that very low (if performance of the FS allow 
it) and get more points of observation and faster reaction times, while 
maxWaitTimeBeforeKill would be tuned as a tradeoff between giving the AM enough 
time to preempt and speed of rebalance. 

I will look into adding the allocation-order as a second-level ordering for 
containers. Please let me know whether this seems enough or I am missing 
something.

                
> FairScheduler: support for work-preserving preemption 
> ------------------------------------------------------
>
>                 Key: YARN-568
>                 URL: https://issues.apache.org/jira/browse/YARN-568
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: scheduler
>            Reporter: Carlo Curino
>            Assignee: Carlo Curino
>         Attachments: YARN-568.patch, YARN-568.patch
>
>
> In the attached patch, we modified  the FairScheduler to substitute its 
> preemption-by-killling with a work-preserving version of preemption (followed 
> by killing if the AMs do not respond quickly enough). This should allows to 
> run preemption checking more often, but kill less often (proper tuning to be 
> investigated).  Depends on YARN-567 and YARN-45, is related to YARN-569.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to