Re: Dynamic Allocation Backlog Property in Spark on Kubernetes

Attila Zsolt Piros Thu, 08 Apr 2021 22:41:50 -0700

You should not set "spark.dynamicAllocation.schedulerBacklogTimeout" so
high and the purpose of this config is very different form the one you
would like to use it for.


The confusion I guess comes from the fact that you are still thinking in
multiple Spark jobs.


*But Dynamic Allocation is useful in case of a single Spark job, too.*With
Dynamic allocation if there are pending tasks then new resources should be
allocated to speed up the calculation.
If you do not have enough partitions then you do not have enough tasks to
run in parallel that was my earlier comment about.

So let's focus on your first job:
- With 3 executors it takes 2 hours to complete, right?
- And what about 8 executors?  I hope significantly less time.

So if you have more than 3 partitions and the tasks are meaningfully long
enough to request some extra resources (schedulerBacklogTimeout) and the
number of running executors are lower than the maximum number of executors
you set (maxExecutors) then why wouldn't you want to use those extra
resources?



On Fri, Apr 9, 2021 at 6:03 AM Ranju Jain <ranju.j...@ericsson.com> wrote:

> Hi Attila,
>
>
>
> Thanks for your reply.
>
>
>
> If I talk about single job which started to run with minExecutors as *3*.
> And Suppose this job [*which reads the full data from backend and process
> and writes it to a location*]
>
> takes around 2 hour to complete.
>
>
>
> What I understood is, as the default value of
> spark.dynamicAllocation.schedulerBacklogTimeout is 1 sec, so executors will
> scale from *3* to *4* and then *8* after every second if tasks are
> pending at scheduler backend. So If I don’t want  it 1 sec and I might set
> it to 1 hour [3600 sec] in 2 hour of spark job.
>
>
>
> So this is all about when I want to scale executors dynamically for spark
> job. Is that understanding correct?
>
>
>
> In the below statement I don’t understand much about available partitions
> :-(
>
> *pending tasks (which kinda related to the available partitions)*
>
>
>
>
>
> Regards
>
> Ranju
>
>
>
>
>
> *From:* Attila Zsolt Piros <piros.attila.zs...@gmail.com>
> *Sent:* Friday, April 9, 2021 12:13 AM
> *To:* Ranju Jain <ranju.j...@ericsson.com.invalid>
> *Cc:* user@spark.apache.org
> *Subject:* Re: Dynamic Allocation Backlog Property in Spark on Kubernetes
>
>
>
> Hi!
>
> For dynamic allocation you do not need to run the Spark jobs in parallel.
> Dynamic allocation simply means Spark scales up by requesting more
> executors when there are pending tasks (which kinda related to the
> available partitions) and scales down when the executor is idle (as within
> one job the number of partitions can fluctuate).
>
> But if you optimize for run time you can start those jobs in parallel at
> the beginning.
>
> In this case you will use higher number of executors even from the
> beginning.
>
> The "spark.dynamicAllocation.schedulerBacklogTimeout" is not for to
> schedule/synchronize different Spark jobs but it is about tasks.
>
> Best regards,
> Attila
>
>
>
> On Tue, Apr 6, 2021 at 1:59 PM Ranju Jain <ranju.j...@ericsson.com.invalid>
> wrote:
>
> Hi All,
>
>
>
> I have set dynamic allocation enabled while running spark on Kubernetes .
> But new executors are requested if pending tasks are backlogged for more
> than configured duration in property
> *“spark.dynamicAllocation.schedulerBacklogTimeout”*.
>
>
>
> My Use Case is:
>
>
>
> There are number of parallel jobs which might or might not run together at
> a particular point of time. E.g Only One Spark Job may run at a point of
> time or two spark jobs may run at a single point of time depending upon the
> need.
>
> I configured spark.dynamicAllocation.minExecutors as 3 and
> spark.dynamicAllocation.maxExecutors as 8 .
>
>
>
> Steps:
>
>    1. SparkContext initialized with 3 executors and First Job requested.
>    2. Now, if second job requested after few mins  (e.g 15 mins) , I am
>    thinking if I can use the benefit of dynamic allocation and executor should
>    scale up to handle second job tasks.
>
> For this I think *“spark.dynamicAllocation.schedulerBacklogTimeout”*
> needs to set after which new executors would be requested.
>
> *Problem: *Problem is there are chances that second job is not requested
> at all or may be requested after 10 mins or after 20 mins. How can I set a
> constant value for
>
> property *“spark.dynamicAllocation.schedulerBacklogTimeout” *to scale the
> executors , when tasks backlog is dependent upon the number of jobs
> requested.
>
>
>
> Regards
>
> Ranju
>
>

Re: Dynamic Allocation Backlog Property in Spark on Kubernetes

Reply via email to