You can try system.gc() considering that checkpointing is enabled by default in graphx :
https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html 2015-09-15 22:42 GMT+08:00 Ali Hadian <had...@comp.iust.ac.ir>: > Hi! > We are executing the PageRank example from the Spark java examples package > on a very large input graph. The code is available here > <https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java>. > (Spark's github repo). > During the execution, the framework generates huge amount of intermediate > data per each iteration (i.e. the *contribs* RDD). The intermediate data > is temporary, but Spark does not clear the intermediate data of previous > iterations. That is to say, if we are in the middle of 20th iteration, all > of the temporary data of all previous iterations (iteration 0 to 19) are > still kept in the *tmp* directory. As a result, the tmp directory grows > linearly. > It seems rational to keep the data from only the previous iteration, > because if the current iteration fails, the job can be continued using the > intermediate data from the previous iteration. Anyways, why does it keep > the intermediate data for ALL previous iterations??? > How can we enforce Spark to clear these intermediate data *during* the > execution of job? > > Kind regards, > Ali hadian > > -- Alexis GILLAIN