Re: [EXTERNAL] Re: Stage level scheduling - lower the number of executors when using GPUs

Artemis User Thu, 03 Nov 2022 08:25:09 -0700

Shay, You may find this video helpful (with some API code samples thatyou are looking for).https://www.youtube.com/watch?v=JNQu-226wUc&t=171s. The issue hereisn't how to limit the number of executors but to request for the rightGPU-enabled executors dynamically. Those executors used in pre-GPUstages should be returned back to resource managers with dynamicresource allocation enabled (and with the right DRA policies). Hopethis helps..

Unfortunately there isn't a lot of detailed docs for this topic sinceGPU acceleration is kind of new in Spark (not straightforward like inTF). I wish the Spark doc team could provide more details in the nextrelease...


On 11/3/22 2:37 AM, Shay Elbaz wrote:

Thanks Artemis. We are *not* using Rapids, but rather using GPUsthrough the Stage Level Scheduling feature with ResourceProfile. InKubernetes you have to turn on shuffle tracking for dynamicallocation, anyhow.The question is how we can limit the *number of executors *whenbuilding a new ResourceProfile, directly (API) or indirectly (someadvanced workaround).
Thanks,
Shay

------------------------------------------------------------------------
*From:* Artemis User <[email protected]>
*Sent:* Thursday, November 3, 2022 1:16 AM
*To:* [email protected] <[email protected]>
*Subject:* [EXTERNAL] Re: Stage level scheduling - lower the number ofexecutors when using GPUs
*ATTENTION:*This email originated from outside of GM.
Are you using Rapids for GPU support in Spark? Couple of options youmay want to try:
 1. In addition to dynamic allocation turned on, you may also need to
    turn on external shuffling service.
 2. Sounds like you are using Kubernetes.  In that case, you may also
    need to turn on shuffle tracking.
 3. The "stages" are controlled by the APIs.  The APIs for dynamic
    resource request (change of stage) do exist, but only for RDDs
    (e.g. TaskResourceRequest and ExecutorResourceRequest).


On 11/2/22 11:30 AM, Shay Elbaz wrote:
Hi,
Our typical applications need less *executors* for a GPU stage thanfor a CPU stage. We are using dynamic allocation with stage levelscheduling, and Spark tries to maximize the number of executors alsoduring the GPU stage, causing a bit of resources chaos in thecluster. This forces us to use a lower value for 'maxExecutors' inthe first place, at the cost of the CPU stages performance. Or try tosolve this in the Kubernets scheduler level, which is notstraightforward and doesn't feel like the right way to go.
Is there a way to effectively use less executors in Stage LevelScheduling? The API does not seem to include such an option, butmaybe there is some more advanced workaround?
Thanks,
Shay

Re: [EXTERNAL] Re: Stage level scheduling - lower the number of executors when using GPUs

Reply via email to