Hi, I'm benchmarking a spark application by running it for multiple iterations, its a benchmark thats heavy on shuffle and I run it on a local machine with a very large hear (~200GB). The system has a SSD. When running for 3 to 4 iterations I get into a situation that I run out of disk space on the /tmp directory. On further investigation I was able to figure out that the reason for this is that the shuffle files are still around, because I have a very large hear GC has not happen and hence the shuffle files are not deleted. I was able to confirm this by lowering the heap size and I see GC kicking in more often and the size of /tmp stays under control. Is there any way I could configure spark to handle this issue?
One option that I have is to have GC run more often by setting spark.cleaner.periodicGC.interval to a much lower value. Is there a cleaner solution? Regards, Keith. http://keith-chapman.com