I am writing something that partitions a data set and then trains a machine learning model on the data in each partition
The resulting model is very big and right now i am storing it in an rdd as a pair of : partition_id and very_big_model_that_is_hundreds_of_megabytes_big but it is becoming increasingly apparent that storing data that big in a single row of an RDD causes all sorts of complications So i figured that instead i could save this model to the filesystem and store a pointer to the model (file path) in the RDD. Then i would simply load the model again in a mapPartitions function and avoid the issue But it raises the question of when to clean up these temporary files. Is there some way to ensure that files outputted by spark code get cleaned up when the sparksession ends or the rdd is no longer referenced ? Or is there any other solution to this problem ?