we have implement this way, we use pyspark, and standalone mode. We collect
the new RDD2 in each iteration. The java heap memory costed by the driver
program increases Gradually. And finally Collapse with OutOfMemory Error.  

We have done some tests, in each iteration, we simply collect a vector. This
Little Simple problem also costed more and more java heap memory, and
finally raised OutOfMemory. 

We don't know how the momery increased. Is it costed by the DAG information?
Or by some variable related with the collect function?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficiently-join-this-two-complicated-rdds-tp1665p1749.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to