> Wrt looping: if I want to process 3 years of data, my modest cluster will never do it one go , I would expect? > I have to break it down in smaller pieces and run that in a loop (1 day is already lots of data).

Well, that is exactly what Spark is made for. It splits the work up and processes it in small pieces, called partitions. No matter how much data you have, it probably works with your laptop (as long as it fits on disk), though it will take some time. But it will succeed. A large cluster is doing nothing else, except for having more partitions being processed in parallel.

You should expect it to work, no matter how many years of data. Otherwise, you have to rethink your Spark code, not your cluster size.

Share some code that does not work with 3 years and people might help. Without that, speculations is all you will get.

Enrico



Am 30.03.22 um 17:40 schrieb Joris Billen:
Thanks for answer-much appreciated! This forum is very useful :-)

I didnt know the sparkcontext stays alive. I guess this is eating up memory.  The eviction means that he knows that he should clear some of the old cached memory to be able to store new one. In case anyone has good articles about memory leaks I would be interested to read. I will try to add following lines at the end of my job (as I cached the table in spark sql):


/sqlContext.sql("UNCACHE TABLE mytableofinterest ")/
/spark.stop()/


Wrt looping: if I want to process 3 years of data, my modest cluster will never do it one go , I would expect? I have to break it down in smaller pieces and run that in a loop (1 day is already lots of data).



Thanks!




On 30 Mar 2022, at 17:25, Sean Owen <sro...@gmail.com> wrote:

The Spark context does not stop when a job does. It stops when you stop it. There could be many ways mem can leak. Caching maybe - but it will evict. You should be clearing caches when no longer needed.

I would guess it is something else your program holds on to in its logic.

Also consider not looping; there is probably a faster way to do it in one go.

On Wed, Mar 30, 2022, 10:16 AM Joris Billen <joris.bil...@bigindustries.be> wrote:

    Hi,
    I have a pyspark job submitted through spark-submit that does
    some heavy processing for 1 day of data. It runs with no errors.
    I have to loop over many days, so I run this spark job in a loop.
    I notice after couple executions the memory is increasing on all
    worker nodes and eventually this leads to faillures. My job does
    some caching, but I understand that when the job ends
    successfully, then the sparkcontext is destroyed and the cache
    should be cleared. However it seems that something keeps on
    filling the memory a bit more and more after each run. THis is
    the memory behaviour over time, which in the end will start
    leading to failures :

    (what we see is: green=physical memory used, green-blue=physical
    memory cached, grey=memory capacity =straight line around 31GB )
    This runs on a healthy spark 2.4 and was optimized already to
    come to a stable job in terms of spark-submit resources
    parameters like
    
driver-memory/num-executors/executor-memory/executor-cores/spark.locality.wait).
    Any clue how to “really” clear the memory in between jobs? So
    basically currently I can loop 10x and then need to restart my
    cluster so all memory is cleared completely.


    Thanks for any info!

<Screenshot 2022-03-30 at 15.28.24.png>

Reply via email to