Re: Spark wastes a lot of space (tmp data) for iterative jobs

Ali Hadian Wed, 16 Sep 2015 07:21:25 -0700

Thanks for your response, Alexis. 

I have seen this page, but its suggested solutions do not work and the tmp 
space still grows linearly after unpersisting RDDs and calling System.gc() 
in each iteration.


I think it might be due to one of the following reasons:

1. System.gc() does not directly invoke the garbage collector, but it just 
requests JVM to run GC, and JVM usually postpones it until memory is almost 
filled. However, since we are just running out of hard-disk space (not 
memory space), GC does not run; therefore the finalize() methods for the 
intermediate RDDs are not triggered.

2. System.gc() is only executed on the driver, but not on the workers (Is it 
how it works??!!)

Any suggestions?

Kind regards
Ali Hadian


-----Original Message-----
From: Alexis Gillain <alexis.gill...@googlemail.com>
To: Ali Hadian <had...@comp.iust.ac.ir>
Cc: spark users <user@spark.apache.org>
Date: Wed, 16 Sep 2015 12:05:35 +0800
Subject: Re: Spark wastes a lot of space (tmp data) for iterative jobs

You can try system.gc() considering that checkpointing is enabled by default 
in graphx :

https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html

2015-09-15 22:42 GMT+08:00 Ali Hadian <had...@comp.iust.ac.ir>:
Hi!
We are executing the PageRank example from the Spark java examples package 
on a very large input graph. The code is available here. (Spark's github 
repo).
During the execution, the framework generates huge amount of intermediate 
data per each iteration (i.e. the contribs RDD). The intermediate data is 
temporary, but Spark does not clear the intermediate data of previous 
iterations. That is to say, if we are in the middle of 20th iteration, all 
of the temporary data of all previous iterations (iteration 0 to 19) are 
still kept in the tmp  directory. As a result, the tmp directory grows 
linearly.
It seems rational to keep the data from only the previous iteration, because 
if the current iteration fails, the job can be continued using the 
intermediate data from the previous iteration. Anyways, why does it keep the 
intermediate data for ALL previous iterations???
How can we enforce Spark to clear these intermediate data during the 
execution of job?

Kind regards, 
Ali hadian




--
Alexis GILLAIN

Re: Spark wastes a lot of space (tmp data) for iterative jobs

Reply via email to