Re: [EXTERNAL] Re: Re: Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-14 Thread Shay Elbaz
We're actually running on on-prem Kubernetes with a custom-built build Spark 
image, with altered entrypoint.sh and other "low-level" scripts and configs, 
but I don't think this is a good direction to solve this specific issue.

Shay

From: Artemis User 
Sent: Thursday, November 3, 2022 8:35 PM
To: user@spark.apache.org 
Subject: [EXTERNAL] Re: Re: Re: Stage level scheduling - lower the number of 
executors when using GPUs


ATTENTION: This email originated from outside of GM.

  Now I see what you want to do.  If you have access to the cluster 
configuration files, you can modify the spark-env.sh file on the worker nodes 
to specify exactly which node you'd like to link with GPU cores and which one 
not.  This would allow only those nodes configured with GPU-resources getting 
scheduled/acquired for your GPU tasks (see Rapids user guide at 
https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html).

We are using Rapids in our on-prem Spark environment with complete control of 
OS, file and network systems, containers and even hardware/GPU settings.  I 
guess you are using one of the cloud services so I am not sure if you have 
access to the low-level cluster config on EMR or GCP, which gave you a 
cookie-cutter type of cluster settings with limited configurability.  But under 
the hood, I believe they do use Nvidia Rapids which currently is the only 
option for GPU acceleration in Spark (Spark 3.x.x distribution package doesn't 
include Rapids or any GPU integration libs).  So you may want to dive into the 
Rapids instructions for more configuration and usage info (it does provide 
detailed instructions on how to run Rapids on EMR, Databricks and GCP).

On 11/3/22 12:10 PM, Shay Elbaz wrote:
Thanks again Artemis, I really appreciate it. I have watched the video but did 
not find an answer.

Please bear with me just one more iteration 

Maybe I'll be more specific:
Suppose I start the application with maxExecutors=500, executors.cores=2, 
because that's the amount of resources needed for the ETL part. But for the DL 
part I only need 20 GPUs. SLS API only allows to set the resources per 
executor/task, so Spark would (try to) allocate up to 500 GPUs, assuming I 
configure the profile with 1 GPU per executor.
So, the question is how do I limit the stage resources to 20 GPUs total?

Thanks again,
Shay


From: Artemis User <mailto:arte...@dtechspace.com>
Sent: Thursday, November 3, 2022 5:23 PM
To: user@spark.apache.org<mailto:user@spark.apache.org> 
<mailto:user@spark.apache.org>
Subject: [EXTERNAL] Re: Re: Stage level scheduling - lower the number of 
executors when using GPUs


ATTENTION: This email originated from outside of GM.

  Shay,  You may find this video helpful (with some API code samples that you 
are looking for).  https://www.youtube.com/watch?v=JNQu-226wUc=171s.  The 
issue here isn't how to limit the number of executors but to request for the 
right GPU-enabled executors dynamically.  Those executors used in pre-GPU 
stages should be returned back to resource managers with dynamic resource 
allocation enabled (and with the right DRA policies).  Hope this helps..

Unfortunately there isn't a lot of detailed docs for this topic since GPU 
acceleration is kind of new in Spark (not straightforward like in TF).   I wish 
the Spark doc team could provide more details in the next release...

On 11/3/22 2:37 AM, Shay Elbaz wrote:
Thanks Artemis. We are not using Rapids, but rather using GPUs through the 
Stage Level Scheduling feature with ResourceProfile. In Kubernetes you have to 
turn on shuffle tracking for dynamic allocation, anyhow.
The question is how we can limit the number of executors when building a new 
ResourceProfile, directly (API) or indirectly (some advanced workaround).

Thanks,
Shay



From: Artemis User <mailto:arte...@dtechspace.com>
Sent: Thursday, November 3, 2022 1:16 AM
To: user@spark.apache.org<mailto:user@spark.apache.org> 
<mailto:user@spark.apache.org>
Subject: [EXTERNAL] Re: Stage level scheduling - lower the number of executors 
when using GPUs


ATTENTION: This email originated from outside of GM.

  Are you using Rapids for GPU support in Spark?  Couple of options you may 
want to try:

  1.  In addition to dynamic allocation turned on, you may also need to turn on 
external shuffling service.
  2.  Sounds like you are using Kubernetes.  In that case, you may also need to 
turn on shuffle tracking.
  3.  The "stages" are controlled by the APIs.  The APIs for dynamic resource 
request (change of stage) do exist, but only for RDDs (e.g. TaskResourceRequest 
and ExecutorResourceRequest).

On 11/2/22 11:30 AM, Shay Elbaz wrote:
Hi,

Our typical applications need less executors for a GPU stage than for a CPU 
stage. We are using dynamic allocation with stage level scheduling, and Spark 
tries

Re: [EXTERNAL] Re: Re: Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-04 Thread Tom Graves
 So I'm not sure I completely follow. Are you asking for a way to change the 
limit without having to do the repartition?  And your DL software doesn't care 
if you got say 30 executors instead of 20?  Normally I would expect the number 
fo partitions at that point to be 200 (or whatever you set for your shuffle 
partitions) unless you are using AQE coalescing partitions functionality and 
then it could change. Are you using the latter?
> Normally I try to aim for anything between 30s-5m per task (failure-wise), 
> depending on the cluster, its stability, etc. But in this case, individual 
> tasks can take 30-60 minutes, if not much more. Any failure during this long 
> time is pretty expensive.
Are you saying when you manually do the repartition your DL tasks take 30-60 
minutes?  so again you want like AQE coalesce partitions to kick in to attempt 
to pick partition sizes for your?


Tom

On Thursday, November 3, 2022 at 03:18:07 PM CDT, Shay Elbaz 
 wrote:  
 
 #yiv4404278030 P {margin-top:0;margin-bottom:0;}This is exactly what we ended 
up doing! The only drawback I saw with this approach is that the GPU tasks get 
pretty big (in terms of data and compute time), and task failures become 
expansive. That's why I reached out to the mailing list in the first place  
Normally I try to aim for anything between 30s-5m per task (failure-wise), 
depending on the cluster, its stability, etc. But in this case, individual 
tasks can take 30-60 minutes, if not much more. Any failure during this long 
time is pretty expensive.

ShayFrom: Tom Graves 
Sent: Thursday, November 3, 2022 7:56 PM
To: Artemis User ; user@spark.apache.org 
; Shay Elbaz 
Subject: [EXTERNAL] Re: Re: Re: Stage level scheduling - lower the number of 
executors when using GPUs 


| 
ATTENTION: This email originated from outside of GM.
 |


 Stage level scheduling does not allow you to change configs right now. This is 
something we thought about as follow on but have never implemented.  How many 
tasks on the DL stage are you running?  The typical case is run some etl lots 
of tasks... do mapPartitions and then run your DL stuff, before that 
mapPartitions you could do a repartition if necessary to get to exactly the 
number of tasks you want (20).  That way even if maxExecutors=500 you will only 
ever need 20 or whatever you repartition to and spark isn't going to ask for 
more then that.
Tom

On Thursday, November 3, 2022 at 11:10:31 AM CDT, Shay Elbaz 
 wrote:

Thanks again Artemis, I really appreciate it. 
I have watched the video but did not find an answer.
Please bear with me just one more iteration 
Maybe I'll be more specific:Suppose I start the application with 
maxExecutors=500, executors.cores=2, because that's the amount of resources 
needed for the ETL part. But for the DL part I only need 20 GPUs. SLS API only 
allows to set the resources per executor/task, so Spark would (try to) allocate 
up to 500 GPUs, assuming I configure the profile with 1 GPU per executor. So, 
the question is how do I limit the stage resources to 20 GPUs total? 
Thanks again,Shay
From: Artemis User 
Sent: Thursday, November 3, 2022 5:23 PM
To: user@spark.apache.org 
Subject: [EXTERNAL] Re: Re: Stage level scheduling - lower the number of 
executors when using GPUs 


| 
ATTENTION: This email originated from outside of GM.
 |


  Shay,  You may find this video helpful (with some API code samples that you 
are looking for). https://www.youtube.com/watch?v=JNQu-226wUc=171s.  The 
issue here isn't how to limit the number of executors but to request for the 
right GPU-enabled executors dynamically.  Those executors used in pre-GPU 
stages should be returned back to resource managers with dynamic resource 
allocation enabled (and with the right DRA policies).  Hope this helps..

Unfortunately there isn't a lot of detailed docs for this topic since GPU 
acceleration is kind of new in Spark (not straightforward like in TF).   I wish 
the Spark doc team could provide more details in the next release...

On 11/3/22 2:37 AM, Shay Elbaz wrote:

Thanks Artemis. We are not using Rapids, 
but rather using GPUs through the Stage Level Scheduling feature with 
ResourceProfile. In Kubernetes you have to turn on shuffle tracking for dynamic 
allocation, anyhow.The question is how we can limit thenumber of executors when 
building a new ResourceProfile, directly (API) or indirectly (some advanced 
workaround).
Thanks,Shay 
From: Artemis User
Sent: Thursday, November 3, 2022 1:16 AM
To: user@spark.apache.org 
Subject: [EXTERNAL] Re: Stage level scheduling - lower the number of executors 
when using GPUs 
| 
ATTENTION: This email originated from outside of GM.
 |


  Are you using Rapids for GPU support in Spark?  Couple of options you may 
want to try:
   
   - In addition to dynamic allocation turned on, you may also need to turn on 
external shuffling service.   

   - Sounds like you are using Kubernetes.  In that case, you may also need to 
turn on shuf

Re: [EXTERNAL] Re: Re: Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-03 Thread Shay Elbaz
This is exactly what we ended up doing! The only drawback I saw with this 
approach is that the GPU tasks get pretty big (in terms of data and compute 
time), and task failures become expansive. That's why I reached out to the 
mailing list in the first place 
Normally I try to aim for anything between 30s-5m per task (failure-wise), 
depending on the cluster, its stability, etc. But in this case, individual 
tasks can take 30-60 minutes, if not much more. Any failure during this long 
time is pretty expensive.


Shay

From: Tom Graves 
Sent: Thursday, November 3, 2022 7:56 PM
To: Artemis User ; user@spark.apache.org 
; Shay Elbaz 
Subject: [EXTERNAL] Re: Re: Re: Stage level scheduling - lower the number of 
executors when using GPUs


ATTENTION: This email originated from outside of GM.


Stage level scheduling does not allow you to change configs right now. This is 
something we thought about as follow on but have never implemented.  How many 
tasks on the DL stage are you running?  The typical case is run some etl lots 
of tasks... do mapPartitions and then run your DL stuff, before that 
mapPartitions you could do a repartition if necessary to get to exactly the 
number of tasks you want (20).  That way even if maxExecutors=500 you will only 
ever need 20 or whatever you repartition to and spark isn't going to ask for 
more then that.

Tom

On Thursday, November 3, 2022 at 11:10:31 AM CDT, Shay Elbaz 
 wrote:


Thanks again Artemis, I really appreciate it. I have watched the video but did 
not find an answer.

Please bear with me just one more iteration 

Maybe I'll be more specific:
Suppose I start the application with maxExecutors=500, executors.cores=2, 
because that's the amount of resources needed for the ETL part. But for the DL 
part I only need 20 GPUs. SLS API only allows to set the resources per 
executor/task, so Spark would (try to) allocate up to 500 GPUs, assuming I 
configure the profile with 1 GPU per executor.
So, the question is how do I limit the stage resources to 20 GPUs total?

Thanks again,
Shay


From: Artemis User 
Sent: Thursday, November 3, 2022 5:23 PM

To: user@spark.apache.org 
Subject: [EXTERNAL] Re: Re: Stage level scheduling - lower the number of 
executors when using GPUs


ATTENTION: This email originated from outside of GM.

  Shay,  You may find this video helpful (with some API code samples that you 
are looking for).  https://www.youtube.com/watch?v=JNQu-226wUc=171s.  The 
issue here isn't how to limit the number of executors but to request for the 
right GPU-enabled executors dynamically.  Those executors used in pre-GPU 
stages should be returned back to resource managers with dynamic resource 
allocation enabled (and with the right DRA policies).  Hope this helps..

Unfortunately there isn't a lot of detailed docs for this topic since GPU 
acceleration is kind of new in Spark (not straightforward like in TF).   I wish 
the Spark doc team could provide more details in the next release...

On 11/3/22 2:37 AM, Shay Elbaz wrote:
Thanks Artemis. We are not using Rapids, but rather using GPUs through the 
Stage Level Scheduling feature with ResourceProfile. In Kubernetes you have to 
turn on shuffle tracking for dynamic allocation, anyhow.
The question is how we can limit the number of executors when building a new 
ResourceProfile, directly (API) or indirectly (some advanced workaround).

Thanks,
Shay



From: Artemis User <mailto:arte...@dtechspace.com>
Sent: Thursday, November 3, 2022 1:16 AM
To: user@spark.apache.org<mailto:user@spark.apache.org> 
<mailto:user@spark.apache.org>
Subject: [EXTERNAL] Re: Stage level scheduling - lower the number of executors 
when using GPUs


ATTENTION: This email originated from outside of GM.

  Are you using Rapids for GPU support in Spark?  Couple of options you may 
want to try:

  1.  In addition to dynamic allocation turned on, you may also need to turn on 
external shuffling service.
  2.  Sounds like you are using Kubernetes.  In that case, you may also need to 
turn on shuffle tracking.
  3.  The "stages" are controlled by the APIs.  The APIs for dynamic resource 
request (change of stage) do exist, but only for RDDs (e.g. TaskResourceRequest 
and ExecutorResourceRequest).

On 11/2/22 11:30 AM, Shay Elbaz wrote:
Hi,

Our typical applications need less executors for a GPU stage than for a CPU 
stage. We are using dynamic allocation with stage level scheduling, and Spark 
tries to maximize the number of executors also during the GPU stage, causing a 
bit of resources chaos in the cluster. This forces us to use a lower value for 
'maxExecutors' in the first place, at the cost of the CPU stages performance. 
Or try to solve this in the Kubernets scheduler level, which is not 
straightforward and doesn't feel like the right way to go.

Is there a way to effectively use less executors in Stage Lev