> Wrt looping: if I want to process 3 years of data, my modest cluster
will never do it one go , I would expect?
> I have to break it down in smaller pieces and run that in a loop (1
day is already lots of data).
Well, that is exactly what Spark is made for. It splits the work up and
processes it in small pieces, called partitions. No matter how much data
you have, it probably works with your laptop (as long as it fits on
disk), though it will take some time. But it will succeed. A large
cluster is doing nothing else, except for having more partitions being
processed in parallel.
You should expect it to work, no matter how many years of data.
Otherwise, you have to rethink your Spark code, not your cluster size.
Share some code that does not work with 3 years and people might help.
Without that, speculations is all you will get.
Enrico
Am 30.03.22 um 17:40 schrieb Joris Billen:
Thanks for answer-much appreciated! This forum is very useful :-)
I didnt know the sparkcontext stays alive. I guess this is eating up
memory. The eviction means that he knows that he should clear some of
the old cached memory to be able to store new one. In case anyone has
good articles about memory leaks I would be interested to read.
I will try to add following lines at the end of my job (as I cached
the table in spark sql):
/sqlContext.sql("UNCACHE TABLE mytableofinterest ")/
/spark.stop()/
Wrt looping: if I want to process 3 years of data, my modest cluster
will never do it one go , I would expect? I have to break it down in
smaller pieces and run that in a loop (1 day is already lots of data).
Thanks!
On 30 Mar 2022, at 17:25, Sean Owen <sro...@gmail.com> wrote:
The Spark context does not stop when a job does. It stops when you
stop it. There could be many ways mem can leak. Caching maybe - but
it will evict. You should be clearing caches when no longer needed.
I would guess it is something else your program holds on to in its
logic.
Also consider not looping; there is probably a faster way to do it in
one go.
On Wed, Mar 30, 2022, 10:16 AM Joris Billen
<joris.bil...@bigindustries.be> wrote:
Hi,
I have a pyspark job submitted through spark-submit that does
some heavy processing for 1 day of data. It runs with no errors.
I have to loop over many days, so I run this spark job in a loop.
I notice after couple executions the memory is increasing on all
worker nodes and eventually this leads to faillures. My job does
some caching, but I understand that when the job ends
successfully, then the sparkcontext is destroyed and the cache
should be cleared. However it seems that something keeps on
filling the memory a bit more and more after each run. THis is
the memory behaviour over time, which in the end will start
leading to failures :
(what we see is: green=physical memory used, green-blue=physical
memory cached, grey=memory capacity =straight line around 31GB )
This runs on a healthy spark 2.4 and was optimized already to
come to a stable job in terms of spark-submit resources
parameters like
driver-memory/num-executors/executor-memory/executor-cores/spark.locality.wait).
Any clue how to “really” clear the memory in between jobs? So
basically currently I can loop 10x and then need to restart my
cluster so all memory is cleared completely.
Thanks for any info!
<Screenshot 2022-03-30 at 15.28.24.png>