we have implement this way, we use pyspark, and standalone mode. We collect the new RDD2 in each iteration. The java heap memory costed by the driver program increases Gradually. And finally Collapse with OutOfMemory Error.
We have done some tests, in each iteration, we simply collect a vector. This Little Simple problem also costed more and more java heap memory, and finally raised OutOfMemory. We don't know how the momery increased. Is it costed by the DAG information? Or by some variable related with the collect function? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficiently-join-this-two-complicated-rdds-tp1665p1749.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
