Thanks for your response, Alexis. I have seen this page, but its suggested solutions do not work and the tmp space still grows linearly after unpersisting RDDs and calling System.gc() in each iteration.
I think it might be due to one of the following reasons: 1. System.gc() does not directly invoke the garbage collector, but it just requests JVM to run GC, and JVM usually postpones it until memory is almost filled. However, since we are just running out of hard-disk space (not memory space), GC does not run; therefore the finalize() methods for the intermediate RDDs are not triggered. 2. System.gc() is only executed on the driver, but not on the workers (Is it how it works??!!) Any suggestions? Kind regards Ali Hadian -----Original Message----- From: Alexis Gillain <alexis.gill...@googlemail.com> To: Ali Hadian <had...@comp.iust.ac.ir> Cc: spark users <user@spark.apache.org> Date: Wed, 16 Sep 2015 12:05:35 +0800 Subject: Re: Spark wastes a lot of space (tmp data) for iterative jobs You can try system.gc() considering that checkpointing is enabled by default in graphx : https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html 2015-09-15 22:42 GMT+08:00 Ali Hadian <had...@comp.iust.ac.ir>: Hi! We are executing the PageRank example from the Spark java examples package on a very large input graph. The code is available here. (Spark's github repo). During the execution, the framework generates huge amount of intermediate data per each iteration (i.e. the contribs RDD). The intermediate data is temporary, but Spark does not clear the intermediate data of previous iterations. That is to say, if we are in the middle of 20th iteration, all of the temporary data of all previous iterations (iteration 0 to 19) are still kept in the tmp directory. As a result, the tmp directory grows linearly. It seems rational to keep the data from only the previous iteration, because if the current iteration fails, the job can be continued using the intermediate data from the previous iteration. Anyways, why does it keep the intermediate data for ALL previous iterations??? How can we enforce Spark to clear these intermediate data during the execution of job? Kind regards, Ali hadian -- Alexis GILLAIN