Re: How to efficiently join this two complicated rdds

Eugen Cepoi Wed, 19 Feb 2014 03:26:21 -0800

Yeah this is due to the fact that the broadcasted variables are kept in
memory and I am guessing that it is referenced in a way that prevents it
from being garbage collected...
A solution could be to enable spark.cleaner.ttl, but I don't like it much
as it sounds more like a hacky solution.
There is also a PR that has been merged few days ago
https://github.com/apache/incubator-spark/pull/543, unfortunately it is not
part of spark 0.9 :(


However you can have a look at the code source and maybe implement it in
your job so that at the end of each iteration you are removing the
broadcasted vars, it should be possible through SparkEnv.get.blockManager.


2014-02-19 12:09 GMT+01:00 hanbo <[email protected]>:

> we have implement this way, we use pyspark, and standalone mode. We collect
> the new RDD2 in each iteration. The java heap memory costed by the driver
> program increases Gradually. And finally Collapse with OutOfMemory Error.
>
> We have done some tests, in each iteration, we simply collect a vector.
> This
> Little Simple problem also costed more and more java heap memory, and
> finally raised OutOfMemory.
>
> We don't know how the momery increased. Is it costed by the DAG
> information?
> Or by some variable related with the collect function?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficiently-join-this-two-complicated-rdds-tp1665p1749.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: How to efficiently join this two complicated rdds

Reply via email to