[ 
https://issues.apache.org/jira/browse/YARN-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13630898#comment-13630898
 ] 

Carlo Curino commented on YARN-45:
----------------------------------

As you pointed out, any decision made in the RM needs to deal with an 
inconsistent and evolving view of the world, and the preemption actions suffer 
from an inherent and significant lag. In designing policies around this, one 
must embrace such chaos and operate conservatively and try to affect only 
macroscopic properties (hence the many built-in dampers Chris mentioned). 

As for what to do with the preemption requests, I think we are quite aligned 
with your comments in our current implementation for the mapreduce AM/Task. 

Here's what we do:
1) Maps are typically short-lived, so it is often worth ignoring the preemption 
request and try to "make a run for it", as checkpointing and completion times 
risk to be comparable, and re-execution costs are low. 

2) For reducer, since the state is valuable and runtimes often longer, the AM 
asks the task to checkpoint. In our current implementation, once the state of 
the reducer has been saved to a checkpoint we exit, as continuing execution is 
non-trivial (in particular managing partial output of reducers).  I can 
envision a future version that tries to continue running after having taken a 
checkpoint. 
Note that this (the task exiting) does not introduce any new 
race-condition/complexity in either RM or AM, as both already handle 
failing/killed tasks, and the AM even have logic to kill its own reducers to 
free up space for maps.  
More importantly, this setup (in which containers exit as soon as they are done 
checkpointing) allows us to set rather generous "wait-before-kill" parameters, 
since the containers will be reclaimed as soon as the task is done 
checkpointing anyway. 
The alternative would have the RM pick a static policy for waiting, which risks 
to be either too long (hence delaying by too much the rebalancing), or too 
short (which risks to interrupt containers while finishing the checkpointing 
thus wasting work). I expect that no static solution would fair well for a 
broad range of AMs and job sizes. 

3) When the preemption takes the form of a ResourceRequest we pick reducers 
over maps (as having reducers running when the map are killed would simply lead 
to wasted slot time). Looking forward in Yarn's future this is a key feature as 
other applications might have evolving priorities for containers which are not 
exposed to the RM, hence we can't rely on the RM to guess which container is 
best to preempt, and delegating the choice to the AM could be invaluable.

                
> Scheduler feedback to AM to release containers
> ----------------------------------------------
>
>                 Key: YARN-45
>                 URL: https://issues.apache.org/jira/browse/YARN-45
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Chris Douglas
>            Assignee: Carlo Curino
>         Attachments: YARN-45.patch, YARN-45.patch
>
>
> The ResourceManager strikes a balance between cluster utilization and strict 
> enforcement of resource invariants in the cluster. Individual allocations of 
> containers must be reclaimed- or reserved- to restore the global invariants 
> when cluster load shifts. In some cases, the ApplicationMaster can respond to 
> fluctuations in resource availability without losing the work already 
> completed by that task (MAPREDUCE-4584). Supplying it with this information 
> would be helpful for overall cluster utilization [1]. To this end, we want to 
> establish a protocol for the RM to ask the AM to release containers.
> [1] http://research.yahoo.com/files/yl-2012-003.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to