Hi,

What do you propose or you think will help when these spark jobs are
independent of each other --> So once a job/iterator is complete, there is
no need to retain these shuffle files. You have a number of options to
consider starting from spark configuration parameters and so forth

https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior

Aside, have you turned on dynamic resource allocation and the relevant
parameters. Can you up executor memory -> spark.storage.,memoryFraction
and spark.shuffle.spillThreshold as well? You can of course use brute force
with shutil.rmtree(path) to remove these files.

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, one verified and tested result holds more weight
than a thousand expert opinions.


On Sat, 17 Feb 2024 at 23:40, Saha, Daniel <dans...@amazon.com.invalid>
wrote:

> Hi,
>
>
>
> *Background*: I am running into executor disk space issues when running a
> long-lived Spark 3.3 app with YARN on AWS EMR. The app performs
> back-to-back spark jobs in a sequential loop with each iteration performing
> 100gb+ shuffles. The files taking up the space are related to shuffle
> blocks [1]. Disk is only cleared when restarting the YARN app. For all
> intents and purposes, each job is independent. So once a job/iterator is
> complete, there is no need to retain these shuffle files. I want to try
> stopping and recreating the Spark context between loop iterations/jobs to
> indicate to Spark DiskBlockManager that these intermediate results are no
> longer needed [2].
>
>
>
> *Questions*:
>
>    - Are there better ways to remove/clean the directory containing these
>    old, no longer used, shuffle results (aside from cron or restarting yarn
>    app)?
>    - How to recreate the spark context within a single application? I see
>    no methods in Spark Session for doing this, and each new Spark session
>    re-uses the existing spark context. After stopping the SparkContext,
>    SparkSession does not re-create a new one. Further, creating a new
>    SparkSession via constructor and passing in a new SparkContext is not
>    allowed as it is a protected/private method.
>
>
>
> Thanks
>
> Daniel
>
>
>
> [1]
> /mnt/yarn/usercache/hadoop/appcache/application_1706835946137_0110/blockmgr-eda47882-56d6-4248-8e30-a959ddb912c5
>
> [2] https://stackoverflow.com/a/38791921
>

Reply via email to