[jira] [Comment Edited] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number

2017-12-12 Thread Julien Cuquemelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286253#comment-16286253
 ] 

Julien Cuquemelle edited comment on SPARK-22683 at 12/12/17 3:14 PM:
-

The impression I get from our discussion is that you mainly focus on the 
latency of the jobs, and that the current setting is optimized for that, which 
is why you consider the current setup sufficient.

If you consider your previous example from a resource usage point of view, my 
proposal would allow to have about the same resource usage in both scenarios, 
but the current setup doubles the resource usage of the workload with small 
tasks... 

I've tried to experiment with the parameters you've proposed, but right now I 
don't have a solution to optimize my type of workload (not every single job) 
for resource consumption. 

I don't know if the majority of Spark users run on idle clusters, but ours is 
routinely full, so for us resource usage is more important than latency.


was (Author: jcuquemelle):
The impression I get from our discussion is that you mainly focus on the 
latency of the jobs, and that the current setting is optimized for that, which 
is why you consider the current setup sufficient.

If you consider your previous example from a resource usage point of view, my 
proposal would allow to have about the same resource usage in both scenarios, 
but the current setup doubles the resource usage of the workload with small 
tasks... 

I've tried to experiment with the parameters you've proposed, but right now I 
don't have a solution to optimize my type of workload (nor every single job) 
for resource consumption. 

I don't know if the majority of Spark users run on idle clusters, but ours is 
routinely full, so for us resource usage is more important than latency.

> Allow tuning the number of dynamically allocated executors wrt task number
> --
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number

2017-12-11 Thread Julien Cuquemelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286032#comment-16286032
 ] 

Julien Cuquemelle edited comment on SPARK-22683 at 12/11/17 3:21 PM:
-

I'd like to point out that when refering to "my use case", it is actually a 
wide range of job sizes, ranging from tens to thousands splits in the data 
partitioning and between 400 to 9000 seconds of wall clock time.

Maybe I didn't emphasize enough in my last answer that the figures I presented 
using schedulerBacklogTimeout needed tuning for a specific job, and that the 
resulting value of the parameter will not be optimal for other job settings. My 
proposal allows to have a sensible default over the whole range of jobs.

This is also true of the maxExecutors parameters, each job would need to be 
tuned using this parameter.

Regarding "there's nothing inherently more efficient about the scheduling 
policy you're proposing", I might add a few information: so far I have aimed at 
having the same latency as the legacy MR jobs, which yield the 6 tasksPerSlot 
figure and a 30% resource usage decrease. 

Here are the figures with other values of taskPerSlot, in terms of gain wrt the 
corresponding MR jobs: 
||taskPerSlot||VcoreGain wrt MR (%)||WallTimeGain wrt MR (%)||
|1.0|-114.329984|43.777071|
|2.0|5.373456|37.610840|
|3.0|20.785099|28.528394|
|4.0|26.501597|18.516913|
|6.0|30.852756|-0.991918|
|8.0|31.147028|-14.633520|

It seems to me the 2 tasks per slot is particularly interesting, as it allows 
to go from a doubling of used resources to a usage similar to MR with a very 
low impact on the latency.

I bet that this resource usage reduction might not be limited to our use case 
...


was (Author: jcuquemelle):
I'd like to point out that when refering to "my use case", it is actually a 
wide range of job sizes, ranging from tens to thousands splits in the data 
partitioning and between 400 to 9000 seconds of wall clock time.

Maybe I didn't emphasize enough in my last answer that the figures I presented 
using schedulerBacklogTimeout needed tuning for a specific job, and that the 
resulting value of the parameter will not be optimal for other job settings. My 
proposal allows to have a sensible default over the whole range of jobs.

This is also true of the maxExecutors parameters, each job would need to be 
tuned using this parameter.

Regarding "there's nothing inherently more efficient about the scheduling 
policy you're proposing", I might add a few information: so far I have aimed at 
having the same latency as the legacy MR jobs, which yield the 6 tasksPerSlot 
figure and a 30% resource usage decrease. 

Here are the figures with other values of taskPerSlot, in terms of gain wrt the 
corresponding MR jobs: 
||taskPerCore||VcoreGain wrt MR (%)||WallTimeGain wrt MR (%)||
|1.0|-114.329984|43.777071|
|2.0|5.373456|37.610840|
|3.0|20.785099|28.528394|
|4.0|26.501597|18.516913|
|6.0|30.852756|-0.991918|
|8.0|31.147028|-14.633520|

It seems to me the 2 tasks per slot is particularly interesting, as it allows 
to go from a doubling of used resources to a usage similar to MR with a very 
low impact on the latency.

I bet that this resource usage reduction might not be limited to our use case 
...

> Allow tuning the number of dynamically allocated executors wrt task number
> --
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number

2017-12-11 Thread Julien Cuquemelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286032#comment-16286032
 ] 

Julien Cuquemelle edited comment on SPARK-22683 at 12/11/17 3:20 PM:
-

I'd like to point out that when refering to "my use case", it is actually a 
wide range of job sizes, ranging from tens to thousands splits in the data 
partitioning and between 400 to 9000 seconds of wall clock time.

Maybe I didn't emphasize enough in my last answer that the figures I presented 
using schedulerBacklogTimeout needed tuning for a specific job, and that the 
resulting value of the parameter will not be optimal for other job settings. My 
proposal allows to have a sensible default over the whole range of jobs.

This is also true of the maxExecutors parameters, each job would need to be 
tuned using this parameter.

Regarding "there's nothing inherently more efficient about the scheduling 
policy you're proposing", I might add a few information: so far I have aimed at 
having the same latency as the legacy MR jobs, which yield the 6 tasksPerSlot 
figure and a 30% resource usage decrease. 

Here are the figures with other values of taskPerSlot, in terms of gain wrt the 
corresponding MR jobs: 
||taskPerCore||VcoreGain wrt MR (%)||WallTimeGain wrt MR (%)||
|1.0|-114.329984|43.777071|
|2.0|5.373456|37.610840|
|3.0|20.785099|28.528394|
|4.0|26.501597|18.516913|
|6.0|30.852756|-0.991918|
|8.0|31.147028|-14.633520|

It seems to me the 2 tasks per slot is particularly interesting, as it allows 
to go from a doubling of used resources to a usage similar to MR with a very 
low impact on the latency.

I bet that this resource usage reduction might not be limited to our use case 
...


was (Author: jcuquemelle):
I'd like to point out that when refering to "my use case", it is actually a 
wide range of job sizes, ranging from tens to thousands splits in the data 
partitioning and between 400 to 9000 seconds of wall clock time.

Maybe I didn't emphasize enough in my last answer that the figures I presented 
using schedulerBacklogTimeout needed tuning for a specific job, and that the 
resulting value of the parameter will not be optimal for other job settings. My 
proposal allows to have a sensible default over the whole range of jobs.

This is also true of the maxExecutors parameters, each job would need to be 
tuned using this parameter.

Regarding "there's nothing inherently more efficient about the scheduling 
policy you're proposing", I might add a few information: so far I have aimed at 
having the same latency as the legacy MR job, which yield the 6 tasksPerSlot 
figure and a 30% resource usage decrease. 

Here are the figures with other values of taskPerSlot, in terms of gain wrt the 
corresponding MR jobs: 
||taskPerCore||VcoreGain wrt MR (%)||WallTimeGain wrt MR (%)||
|1.0|-114.329984|43.777071|
|2.0|5.373456|37.610840|
|3.0|20.785099|28.528394|
|4.0|26.501597|18.516913|
|6.0|30.852756|-0.991918|
|8.0|31.147028|-14.633520|

It seems to me the 2 tasks per slot is particularly interesting, as it allows 
to go from a doubling of used resources to a usage similar to MR with a very 
low impact on the latency.

I bet that this resource usage reduction might not be limited to our use case 
...

> Allow tuning the number of dynamically allocated executors wrt task number
> --
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number

2017-12-11 Thread Julien Cuquemelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286032#comment-16286032
 ] 

Julien Cuquemelle edited comment on SPARK-22683 at 12/11/17 3:19 PM:
-

I'd like to point out that when refering to "my use case", it is actually a 
wide range of job sizes, ranging from tens to thousands splits in the data 
partitioning and between 400 to 9000 seconds of wall clock time.

Maybe I didn't emphasize enough in my last answer that the figures I presented 
using schedulerBacklogTimeout needed tuning for a specific job, and that the 
resulting value of the parameter will not be optimal for other job settings. My 
proposal allows to have a sensible default over the whole range of jobs.

This is also true of the maxExecutors parameters, each job would need to be 
tuned using this parameter.

Regarding "there's nothing inherently more efficient about the scheduling 
policy you're proposing", I might add a few information: so far I have aimed at 
having the same latency as the legacy MR job, which yield the 6 tasksPerSlot 
figure and a 30% resource usage decrease. 

Here are the figures with other values of taskPerSlot, in terms of gain wrt the 
corresponding MR jobs: 
||taskPerCore||VcoreGain wrt MR (%)||WallTimeGain wrt MR (%)||
|1.0|-114.329984|43.777071|
|2.0|5.373456|37.610840|
|3.0|20.785099|28.528394|
|4.0|26.501597|18.516913|
|6.0|30.852756|-0.991918|
|8.0|31.147028|-14.633520|

It seems to me the 2 tasks per slot is particularly interesting, as it allows 
to go from a doubling of used resources to a usage similar to MR with a very 
low impact on the latency.

I bet that this resource usage reduction might not be limited to our use case 
...


was (Author: jcuquemelle):
I'd like to point out that when refering to "my use case", it is actually a 
wide range of job sizes, ranging from tens to thousands splits in the data 
partitioning and between 400 to 9000 seconds of wall clock time.

Maybe I didn't emphasize enough in my last answer that the figures I presented 
using schedulerBacklogTimeout needed tuning for a specific job, and that the 
resulting value of the parameter will not be optimal for other job settings. My 
proposal allows to have a sensible default over the whole range of jobs.

This is also true of the maxExecutors parameters, each job needs to be tuned 
using this parameter.

Regarding "there's nothing inherently more efficient about the scheduling 
policy you're proposing", I might add a few information: so far I have aimed at 
having the same latency as the legacy MR job, which yield the 6 tasksPerSlot 
figure and a 30% resource usage decrease. 

Here are the figures with other values of taskPerSlot, in terms of gain wrt the 
corresponding MR jobs: 
||taskPerCore||VcoreGain wrt MR (%)||WallTimeGain wrt MR (%)||
|1.0|-114.329984|43.777071|
|2.0|5.373456|37.610840|
|3.0|20.785099|28.528394|
|4.0|26.501597|18.516913|
|6.0|30.852756|-0.991918|
|8.0|31.147028|-14.633520|

It seems to me the 2 tasks per slot is particularly interesting, as it allows 
to go from a doubling of used resources to a usage similar to MR with a very 
low impact on the latency.

I bet that this resource usage reduction might not be limited to our use case 
...

> Allow tuning the number of dynamically allocated executors wrt task number
> --
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number

2017-12-06 Thread Julien Cuquemelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16280548#comment-16280548
 ] 

Julien Cuquemelle edited comment on SPARK-22683 at 12/6/17 5:32 PM:


I don't understand your statement about delaying executor addition ? I want to 
cap the number of executors in an adaptive way regarding the current number of 
tasks, not delay their creation.

Doing this with dynamicAllocation.maxExecutors requires each job to be tuned 
for efficiency; when we're doing experiments, a lot of jobs are one shot, so 
they can't be fine tuned.
The proposal gives a way to have an adaptive behaviour for a family of jobs.

Regarding slowing the ramp up of executors with schedulerBacklogTimeout, I've 
made experiments to play with this parameter; I have made 2 series of 
experiments (7 jobs on each test case, average figures reported in the 
following table), one on a busy queue, and the other on an idle queue. I'll 
report only the idle queue, as the figures 
on the busy queue are even worse for the schedulerBacklogTimeout approach: 

First row is using the default 1s for the schedulerBacklogTimeout, and uses the 
6 tasks per executorSlot I've mentioned above, other rows use the default 
dynamicAllocation behaviour and only change schedulerBacklogTimeout

||SparkWallTimeSec||Spk-vCores-H||taskPerExeSlot||schedulerBacklogTimeout||
|693.571429|37.142857|6|1.0|
|584.857143|69.571429|1|30.0|
|763.428571|54.285714|1|60.0|
|826.714286|39.571429|1|90.0|

So basically I can tune the backlogTimeout to get a similar vCores-H 
consumption at the expense of almost 20% more wallClockTime, or I can tune the 
parameter to get about the same wallClockTime at the expense of about 60% more 
vcoreH consumption (very roughly extrapolated between 30 and 60 secs for 
schedulerBacklogTimeout).

It does not seem to solve the issue I'm trying to address, moreover this would 
again need to be tuned for each specific job's duration (to find the 90s 
timeout to get the similar resource consumption, I had to solve the exponential 
ramp-up with the duration of the already run job, which is not feasible in 
experimental use cases ).
The previous experiments that allowed me to find the sweet spot at 6 tasks per 
slot has involved job wallClockTimes between 400 and 9000 seconds

Another way to have a look at this new parameter I'm proposing is to have a 
simple way to tune the latency / resource consumption tradeoff. 


was (Author: jcuquemelle):
I don't understand your statement about delaying executor addition ? I want to 
cap the number of executors in an adaptive way regarding the current number of 
tasks, not delay their creation.

Doing this with dynamicAllocation.maxExecutors requires each job to be tuned 
for efficiency; when we're doing experiments, a lot of jobs are one shot, so 
they can't be fine tuned.
The proposal gives a way to have an adaptive behaviour for a family of jobs.

Regarding slowing the ramp up of executors with schedulerBacklogTimeout, I've 
made experiments to play with this parameter; I have made 2 series of 
experiments (7 jobs on each test case, average figures reported in the 
following table), one on a busy queue, and the other on an idle queue. I'll 
report only the idle queue, as the figures 
on the busy queue are even worse for the schedulerBacklogTimeout approach: 

First row is using the default 1s for the schedulerBacklogTimeout, and uses the 
6 tasks per executorSlot I've mentioned above, other rows use the default 
dynamicAllocation behaviour and only change schedulerBacklogTimeout

SparkWallTimeSecSpk-vCores-HtaskPerExeSlotschedulerBacklogTimeout

693.57142937.142857  6 1.0
584.85714369.571429  1 30.0
763.42857154.285714  1 60.0
826.71428639.571429  1 90.0

So basically I can tune the backlogTimeout to get a similar vCores-H 
consumption at the expense of almost 20% more wallClockTime, or I can tune the 
parameter to get about the same wallClockTime at the expense of about 60% more 
vcoreH consumption (very roughly extrapolated between 30 and 60 secs for 
schedulerBacklogTimeout).

It does not seem to solve the issue I'm trying to address, moreover this would 
again need to be tuned for each specific job's duration (to find the 90s 
timeout to get the similar resource consumption, I had to solve the exponential 
ramp-up with the duration of the already run job, which is not feasible in 
experimental use cases ).
The previous experiments that allowed me to find the sweet spot at 6 tasks per 
slot has involved job wallClockTimes between 400 and 9000 seconds

Another way to have a look at this new parameter I'm proposing is to have a 
simple way to tune the latency / resource 

[jira] [Comment Edited] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number

2017-12-05 Thread Julien Cuquemelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278405#comment-16278405
 ] 

Julien Cuquemelle edited comment on SPARK-22683 at 12/5/17 1:09 PM:


Thanks [~srowen] for your quick feedback, let me give you more context / 
explanation about what I'm trying to achieve here. 

- I'm not trying to mitigate the overhead of launching small tasks, I'm trying 
to avoid allocating executors that will not / barely be used.
As spark tuning guideline usually aim towards having small tasks, it seems 
sub-optimal to allocate a large number of executors at application start (or at 
any stage that would create much more tasks than available executor slots), 
that end up being discarded without having been used.

- I've scoped this parameter to dynamic allocation, as it only impacts the way 
the executor count ramps up when tasks are backlogged. The default value of the 
parameter falls back to the current behavior.

- I do agree that over-allocated containers will eventually be deallocated, but 
this resource occupation by idle executors appears to be
non trivial on some of our workloads: 
I've made several experiments on a set of similar jobs working on different 
data sizes, and running on an idle or busy cluster;
The resource usage (vcore-hours allocated as seen by Yarn) of the spark 
jobs is compared to the same job running in MapReduce;
The executor idle timeout is set to 30s;
When running with the current setting (1 task per executor slot), our Spark 
jobs consume in average twice the amount of vcorehours than the MR job
When running with 6 tasks per executor slot, our Spark jobs consume in 
average 30% less vcorehours than the MR jobs, this setting being valid for 
different workload sizes.



was (Author: jcuquemelle):
Thanks [~srowen] for your quick feedback, let me give you more context / 
explanation about what I'm trying to achieve here. 

- I'm not trying to mitigate the overhead of launching small tasks, I'm trying 
to avoid allocating executors that will not / barely be used.
As spark tuning guideline usually aim towards having small tasks, it seems 
sub-optimal to allocate a large number of executors at application start
(or at any stage that would create much more tasks than available executor 
slots), that end up being discarded without having been used.

- I've scoped this parameter to dynamic allocation, as it only impacts the way 
the executor count ramps up when tasks are backlogged. The default value of the 
parameter falls back to the current behavior.

- I do agree that over-allocated containers will eventually be deallocated, but 
this resource occupation by idle executors appears to be
non trivial on some of our workloads: 
I've made several experiments on a set of similar jobs working on different 
data sizes, and running on an idle or busy cluster;
The resource usage (vcore-hours allocated as seen by Yarn) of the spark 
jobs is compared to the same job running in MapReduce;
The executor idle timeout is set to 30s;
When running with the current setting (1 task per executor slot), our Spark 
jobs consume in average twice the amount of vcorehours than the MR job
When running with 6 tasks per executor slot, our Spark jobs consume in 
average 30% less vcorehours than the MR jobs, this setting being valid for 
different workload sizes.


> Allow tuning the number of dynamically allocated executors wrt task number
> --
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number

2017-12-05 Thread Julien Cuquemelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278405#comment-16278405
 ] 

Julien Cuquemelle edited comment on SPARK-22683 at 12/5/17 11:06 AM:
-

Thanks [~srowen] for your quick feedback, let me give you more context / 
explanation about what I'm trying to achieve here. 

- I'm not trying to mitigate the overhead of launching small tasks, I'm trying 
to avoid allocating executors that will not / barely be used.
As spark tuning guideline usually aim towards having small tasks, it seems 
sub-optimal to allocate a large number of executors at application start
(or at any stage that would create much more tasks than available executor 
slots), that end up being discarded without having been used.

- I've scoped this parameter to dynamic allocation, as it only impacts the way 
the executor count ramps up when tasks are backlogged. The default value of the 
parameter falls back to the current behavior.

- I do agree that over-allocated containers will eventually be deallocated, but 
this resource occupation by idle executors appears to be
non trivial on some of our workloads: 
I've made several experiments on a set of similar jobs working on different 
data sizes, and running on an idle or busy cluster;
The resource usage (vcore-hours allocated as seen by Yarn) of the spark 
jobs is compared to the same job running in MapReduce;
The executor idle timeout is set to 30s;
When running with the current setting (1 task per executor slot), our Spark 
jobs consume in average twice the amount of vcorehours than the MR job
When running with 6 tasks per executor slot, our Spark jobs consume in 
average 30% less vcorehours than the MR jobs, this setting being valid for 
different workload sizes.



was (Author: jcuquemelle):
Thanks [~srowen] for your quick feedback, let me give you more context / 
explanation about what I'm trying to achieve here. 

- I'm not trying to mitigate the overhead of launching small tasks, I'm trying 
to avoid allocating executors that will not / barely be used.
As spark tuning guideline usually aim towards having small tasks, it seems 
sub-optimal to allocate a large number of executors at application start
(or at any stage that would create much more tasks than available executor 
slots), that end up being discarded without having been used.

- I've scoped this parameter to dynamic allocation, as it only impacts the way 
the executor count ramps up when tasks are backlogged. The default value of the 
parameter falls back to the current behavior.

- I do agree that over-allocated containers will eventually be deallocated, but 
this resource occupation by idle executors appears to be
non trivial on some of our workloads: 
I've made several experiments on a set of similar jobs working on different 
data sizes, and running on an idle or busy cluster;
The resource usage (vcore-hours allocated as seen by Yarn) of the spark 
jobs is compared to the same job running in MapReduce;
The executor idle timeout is set to 30s;
When running with the current setting (1 task per executor slot), our Spark 
jobs consume in average twice the amount of vcorehours than the MR job
When running with 6 tasks per executor slot, our Spark jobs consume in 
average 30% less vcorehours than the MR jobs, this setting being valid for 
different workload sizes.


> Allow tuning the number of dynamically allocated executors wrt task number
> --
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org