Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

Enrico Minack Wed, 30 Mar 2022 12:48:14 -0700

> Wrt looping: if I want to process 3 years of data, my modest clusterwill never do it one go , I would expect?> I have to break it down in smaller pieces and run that in a loop (1day is already lots of data).

Well, that is exactly what Spark is made for. It splits the work up andprocesses it in small pieces, called partitions. No matter how much datayou have, it probably works with your laptop (as long as it fits ondisk), though it will take some time. But it will succeed. A largecluster is doing nothing else, except for having more partitions beingprocessed in parallel.

You should expect it to work, no matter how many years of data.Otherwise, you have to rethink your Spark code, not your cluster size.

Share some code that does not work with 3 years and people might help.Without that, speculations is all you will get.


Enrico



Am 30.03.22 um 17:40 schrieb Joris Billen:

Thanks for answer-much appreciated! This forum is very useful :-)
I didnt know the sparkcontext stays alive. I guess this is eating upmemory. The eviction means that he knows that he should clear some ofthe old cached memory to be able to store new one. In case anyone hasgood articles about memory leaks I would be interested to read.I will try to add following lines at the end of my job (as I cachedthe table in spark sql):
/sqlContext.sql("UNCACHE TABLE mytableofinterest ")/
/spark.stop()/
Wrt looping: if I want to process 3 years of data, my modest clusterwill never do it one go , I would expect? I have to break it down insmaller pieces and run that in a loop (1 day is already lots of data).
Thanks!
On 30 Mar 2022, at 17:25, Sean Owen <sro...@gmail.com> wrote:
The Spark context does not stop when a job does. It stops when youstop it. There could be many ways mem can leak. Caching maybe - butit will evict. You should be clearing caches when no longer needed.
I would guess it is something else your program holds on to in itslogic.
Also consider not looping; there is probably a faster way to do it inone go.
On Wed, Mar 30, 2022, 10:16 AM Joris Billen<joris.bil...@bigindustries.be> wrote:
    Hi,
    I have a pyspark job submitted through spark-submit that does
    some heavy processing for 1 day of data. It runs with no errors.
    I have to loop over many days, so I run this spark job in a loop.
    I notice after couple executions the memory is increasing on all
    worker nodes and eventually this leads to faillures. My job does
    some caching, but I understand that when the job ends
    successfully, then the sparkcontext is destroyed and the cache
    should be cleared. However it seems that something keeps on
    filling the memory a bit more and more after each run. THis is
    the memory behaviour over time, which in the end will start
    leading to failures :

    (what we see is: green=physical memory used, green-blue=physical
    memory cached, grey=memory capacity =straight line around 31GB )
    This runs on a healthy spark 2.4 and was optimized already to
    come to a stable job in terms of spark-submit resources
    parameters like
    
driver-memory/num-executors/executor-memory/executor-cores/spark.locality.wait).
    Any clue how to “really” clear the memory in between jobs? So
    basically currently I can loop 10x and then need to restart my
    cluster so all memory is cleared completely.


    Thanks for any info!

<Screenshot 2022-03-30 at 15.28.24.png>

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

Reply via email to