Re: Spark wastes a lot of space (tmp data) for iterative jobs

2015-09-17 Thread Ali Hadian
Thansk, but as I know, checkpointing is specific to streaming RDDs and is 
not implemented in regular RDDs (just inherited from the superclass, but not 
implemented).

How can I checkpoint the intermediate JavaRDDs??

-Original Message-
From: Alexis Gillain 
To: Ali Hadian 
Cc: spark users 
Date: Thu, 17 Sep 2015 02:03:46 +0800
Subject: Re: Spark wastes a lot of space (tmp data) for iterative jobs

Ok just realized you don't use mllib pagerank.

You must use checkpointing as pointed in the databricks url.

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/impl/PeriodicGraphCheckpointer.scala

Due to lineage Spark doesn't erase the shuffle file.
When you do :
contrib = link.join(rank)
rank = contrib.map(...)
contrib=link.join(rank)
I think Spark doesn't erase the shuffle files of the first join because they 
are still part of the lineage of the second contrib through rank.

Have a look at this : https://www.youtube.com/watch?v=1MWxIUoIYFA

2015-09-16 22:16 GMT+08:00 Ali Hadian :
Thanks for your response, Alexis.

I have seen this page, but its suggested solutions do not work and the tmp 
space still grows linearly after unpersisting RDDs and calling System.gc() 
in each iteration.

I think it might be due to one of the following reasons:

1. System.gc() does not directly invoke the garbage collector, but it just 
requests JVM to run GC, and JVM usually postpones it until memory is almost 
filled. However, since we are just running out of hard-disk space (not 
memory space), GC does not run; therefore the finalize() methods for the 
intermediate RDDs are not triggered.


2. System.gc() is only executed on the driver, but not on the workers (Is it 
how it works??!!)

Any suggestions?

Kind regards
Ali Hadian

-Original Message-
From: Alexis Gillain 
To: Ali Hadian 
Cc: spark users 
Date: Wed, 16 Sep 2015 12:05:35 +0800
Subject: Re: Spark wastes a lot of space (tmp data) for iterative jobs

You can try system.gc() considering that checkpointing is enabled by default 
in graphx :

https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html

2015-09-15 22:42 GMT+08:00 Ali Hadian < had...@comp.iust.ac.ir>:
Hi!
We are executing the PageRank example from the Spark java examples package 
on a very large input graph. The code is available here. (Spark's github 
repo).
During the execution, the framework generates huge amount of intermediate 
data per each iteration (i.e. the contribs RDD). The intermediate data is 
temporary, but Spark does not clear the intermediate data of previous 
iterations. That is to say, if we are in the middle of 20th iteration, all 
of the temporary data of all previous iterations (iteration 0 to 19) are 
still kept in the tmp  directory. As a result, the tmp directory grows 
linearly.
It seems rational to keep the data from only the previous iteration, because 
if the current iteration fails, the job can be continued using the 
intermediate data from the previous iteration. Anyways, why does it keep the 
intermediate data for ALL previous iterations???
How can we enforce Spark to clear these intermediate data  during the 
execution of job?

Kind regards, 
Ali hadian




--
Alexis GILLAIN



--
Alexis GILLAIN

Re: Spark wastes a lot of space (tmp data) for iterative jobs

2015-09-16 Thread Alexis Gillain
Ok just realized you don't use mllib pagerank.

You must use checkpointing as pointed in the databricks url.

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/impl/PeriodicGraphCheckpointer.scala

Due to lineage Spark doesn't erase the shuffle file.
When you do :
contrib = link.join(rank)
rank = contrib.map(...)
contrib=link.join(rank)
I think Spark doesn't erase the shuffle files of the first join because
they are still part of the lineage of the second contrib through rank.

Have a look at this : https://www.youtube.com/watch?v=1MWxIUoIYFA

2015-09-16 22:16 GMT+08:00 Ali Hadian :

> Thanks for your response, Alexis.
>
> I have seen this page, but its suggested solutions do not work and the tmp
> space still grows linearly after unpersisting RDDs and calling
> System.gc() in each iteration.
>
> I think it might be due to one of the following reasons:
>
> 1. System.gc() does not directly invoke the garbage collector, but it just
> requests JVM to run GC, and JVM usually postpones it until memory is
> almost filled. However, since we are just running out of hard-disk space
> (not memory space), GC does not run; therefore the finalize() methods for
> the intermediate RDDs are not triggered.
>
>
> 2. System.gc() is only executed on the driver, but not on the workers (Is
> it how it works??!!)
>
> Any suggestions?
>
> Kind regards
> Ali Hadian
>
> -Original Message-
> From: Alexis Gillain 
> To: Ali Hadian 
> Cc: spark users 
> Date: Wed, 16 Sep 2015 12:05:35 +0800
> Subject: Re: Spark wastes a lot of space (tmp data) for iterative jobs
>
> You can try system.gc() considering that checkpointing is enabled by
> default in graphx :
>
>
> https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html
>
> 2015-09-15 22:42 GMT+08:00 Ali Hadian < had...@comp.iust.ac.ir>:
>
>> Hi!
>> We are executing the PageRank example from the Spark java examples
>> package on a very large input graph. The code is available here
>> <https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java>.
>> (Spark's github repo).
>> During the execution, the framework generates huge amount of intermediate
>> data per each iteration (i.e. the *contribs* RDD). The intermediate data
>> is temporary, but Spark does not clear the intermediate data of previous
>> iterations. That is to say, if we are in the middle of 20th iteration, all
>> of the temporary data of all previous iterations (iteration 0 to 19) are
>> still kept in the *tmp*  directory. As a result, the tmp directory grows
>> linearly.
>> It seems rational to keep the data from only the previous iteration,
>> because if the current iteration fails, the job can be continued using the
>> intermediate data from the previous iteration. Anyways, why does it keep
>> the intermediate data for ALL previous iterations???
>> How can we enforce Spark to clear these intermediate data * during* the
>> execution of job?
>>
>> Kind regards,
>> Ali hadian
>>
>>
>
>
>
> --
> Alexis GILLAIN
>
>


-- 
Alexis GILLAIN


Re: Spark wastes a lot of space (tmp data) for iterative jobs

2015-09-16 Thread Ali Hadian
Thanks for your response, Alexis. 

I have seen this page, but its suggested solutions do not work and the tmp 
space still grows linearly after unpersisting RDDs and calling System.gc() 
in each iteration.

I think it might be due to one of the following reasons:

1. System.gc() does not directly invoke the garbage collector, but it just 
requests JVM to run GC, and JVM usually postpones it until memory is almost 
filled. However, since we are just running out of hard-disk space (not 
memory space), GC does not run; therefore the finalize() methods for the 
intermediate RDDs are not triggered.

2. System.gc() is only executed on the driver, but not on the workers (Is it 
how it works??!!)

Any suggestions?

Kind regards
Ali Hadian


-Original Message-
From: Alexis Gillain 
To: Ali Hadian 
Cc: spark users 
Date: Wed, 16 Sep 2015 12:05:35 +0800
Subject: Re: Spark wastes a lot of space (tmp data) for iterative jobs

You can try system.gc() considering that checkpointing is enabled by default 
in graphx :

https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html

2015-09-15 22:42 GMT+08:00 Ali Hadian :
Hi!
We are executing the PageRank example from the Spark java examples package 
on a very large input graph. The code is available here. (Spark's github 
repo).
During the execution, the framework generates huge amount of intermediate 
data per each iteration (i.e. the contribs RDD). The intermediate data is 
temporary, but Spark does not clear the intermediate data of previous 
iterations. That is to say, if we are in the middle of 20th iteration, all 
of the temporary data of all previous iterations (iteration 0 to 19) are 
still kept in the tmp  directory. As a result, the tmp directory grows 
linearly.
It seems rational to keep the data from only the previous iteration, because 
if the current iteration fails, the job can be continued using the 
intermediate data from the previous iteration. Anyways, why does it keep the 
intermediate data for ALL previous iterations???
How can we enforce Spark to clear these intermediate data during the 
execution of job?

Kind regards, 
Ali hadian




--
Alexis GILLAIN

Re: Spark wastes a lot of space (tmp data) for iterative jobs

2015-09-15 Thread Alexis Gillain
You can try system.gc() considering that checkpointing is enabled by
default in graphx :

https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html

2015-09-15 22:42 GMT+08:00 Ali Hadian :

> Hi!
> We are executing the PageRank example from the Spark java examples package
> on a very large input graph. The code is available here
> .
> (Spark's github repo).
> During the execution, the framework generates huge amount of intermediate
> data per each iteration (i.e. the *contribs* RDD). The intermediate data
> is temporary, but Spark does not clear the intermediate data of previous
> iterations. That is to say, if we are in the middle of 20th iteration, all
> of the temporary data of all previous iterations (iteration 0 to 19) are
> still kept in the *tmp*  directory. As a result, the tmp directory grows
> linearly.
> It seems rational to keep the data from only the previous iteration,
> because if the current iteration fails, the job can be continued using the
> intermediate data from the previous iteration. Anyways, why does it keep
> the intermediate data for ALL previous iterations???
> How can we enforce Spark to clear these intermediate data *during* the
> execution of job?
>
> Kind regards,
> Ali hadian
>
>



-- 
Alexis GILLAIN


Spark wastes a lot of space (tmp data) for iterative jobs

2015-09-15 Thread Ali Hadian
Hi!

We are executing the PageRank example from the Spark java examples package 
on a very large input graph. The code is available here. (Spark's github 
repo).

During the execution, the framework generates huge amount of intermediate 
data per each iteration (i.e. the contribs RDD). The intermediate data is 
temporary, but Spark does not clear the intermediate data of previous 
iterations. That is to say, if we are in the middle of 20th iteration, all 
of the temporary data of all previous iterations (iteration 0 to 19) are 
still kept in the tmp directory. As a result, the tmp directory grows 
linearly.

It seems rational to keep the data from only the previous iteration, because 
if the current iteration fails, the job can be continued using the 
intermediate data from the previous iteration. Anyways, why does it keep the 
intermediate data for ALL previous iterations???

How can we enforce Spark to clear these intermediate data during the 
execution of job?

Kind regards, 
Ali hadian