Cleanup hook for temporary files produced as part of a spark job

jelmer Sun, 24 May 2020 06:42:48 -0700

I am writing something that partitions a data set and then trains a machine
learning model on the data in each partition


The resulting model is very big  and right now i am storing it in an rdd as
a pair of  :
partition_id and very_big_model_that_is_hundreds_of_megabytes_big

but it is becoming increasingly apparent that storing data that big in a
single row of an RDD causes all sorts of complications

So i figured that instead i could save this model to the filesystem and
store a pointer to the model (file path) in the RDD.  Then i would simply
load the model again in a mapPartitions function and avoid the issue

But it raises the question of when to clean up these temporary files. Is
there some way to ensure that files outputted by spark code get cleaned up
when the sparksession ends or the rdd is no longer referenced ?

Or is there any other solution to this problem ?

Cleanup hook for temporary files produced as part of a spark job

Reply via email to