Adam Kennedy created SPARK-25889:
------------------------------------

             Summary: Dynamic allocation load-aware ramp up
                 Key: SPARK-25889
                 URL: https://issues.apache.org/jira/browse/SPARK-25889
             Project: Spark
          Issue Type: New Feature
          Components: Scheduler, YARN
    Affects Versions: 2.3.2
            Reporter: Adam Kennedy


The time based exponential ramp up behavior for dynamic allocation is naive and 
destructive, making it very difficult to run very large jobs.

On a large (64,000 core) YARN cluster with a high number of input partitions 
(200,000+) the default dynamic allocation approach of requesting containers in 
waves, doubling exponentially once per second, results in 50% of the entire 
cluster being requested in the final 1 second wave.

This can easily overwhelm RPC processing, or cause expensive Executor startup 
steps to break systems. With the interval so short, many additional containers 
may be requested beyond what is actually needed and then complete very little 
work before sitting around waiting to be deallocated.

Delaying the time between these fixed doublings only has limited impact. 
Setting double intervals to once per minute would result in a very slow ramp up 
speed, at the end of which we still face large potentially crippling waves of 
executor startup.

An alternative approach to spooling up large job appears to be needed, which is 
still relatively simple but could be more adaptable to different cluster sizes 
and differing cluster and job performance.

I would like to propose a few different approaches based around the general 
idea of controlling outstanding requests for new containers based on the number 
of executors that are currently running, for some definition of "running".

One example might be to limit requests to one new executor for every existing 
executor that currently has an active task. Or some ratio of that, to allow for 
more or less aggressive spool up. A lower number would let us approximate 
something like fibonacci ramp up, a higher number of say 2x would spool up 
quickly, but still aligned with the rate at which broadcast blocks can be 
easily distributed to new members.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to