Re: Spark wastes a lot of space (tmp data) for iterative jobs

Alexis Gillain Tue, 15 Sep 2015 21:06:39 -0700

You can try system.gc() considering that checkpointing is enabled by
default in graphx :


https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html

2015-09-15 22:42 GMT+08:00 Ali Hadian <had...@comp.iust.ac.ir>:

> Hi!
> We are executing the PageRank example from the Spark java examples package
> on a very large input graph. The code is available here
> <https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java>.
> (Spark's github repo).
> During the execution, the framework generates huge amount of intermediate
> data per each iteration (i.e. the *contribs* RDD). The intermediate data
> is temporary, but Spark does not clear the intermediate data of previous
> iterations. That is to say, if we are in the middle of 20th iteration, all
> of the temporary data of all previous iterations (iteration 0 to 19) are
> still kept in the *tmp*  directory. As a result, the tmp directory grows
> linearly.
> It seems rational to keep the data from only the previous iteration,
> because if the current iteration fails, the job can be continued using the
> intermediate data from the previous iteration. Anyways, why does it keep
> the intermediate data for ALL previous iterations???
> How can we enforce Spark to clear these intermediate data *during* the
> execution of job?
>
> Kind regards,
> Ali hadian
>
>



-- 
Alexis GILLAIN

Re: Spark wastes a lot of space (tmp data) for iterative jobs

Reply via email to