I'm benchmarking a spark application by running it for multiple iterations,
its a benchmark thats heavy on shuffle and I run it on a local machine with
a very large hear (~200GB). The system has a SSD. When running for 3 to 4
iterations I get into a situation that I run out of disk space on the /tmp
directory. On further investigation I was able to figure out that the
reason for this is that the shuffle files are still around, because I have
a very large hear GC has not happen and hence the shuffle files are not
deleted. I was able to confirm this by lowering the heap size and I see GC
kicking in more often and the size of /tmp stays under control. Is there
any way I could configure spark to handle this issue?

One option that I have is to have GC run more often by
setting spark.cleaner.periodicGC.interval to a much lower value. Is there a
cleaner solution?



Reply via email to