Thank you very much for your answer. We have try the method above before.
This is the problem during doing so.

1. We want to avoid collect method because we do this step in the iteration
and the RDD2 changes in every iteration. So the speed ,usage of memory and
network traffic bother us a lot. 

2. The keys in RDD1 is not well-distributed. For example, key "1" is in
every line but the total number of key "1765" in RDD1 may be less than 10.
It will cause some workers have more data to process and cost more time.

We have done some experiment using data which has much small size but same
form. The method above will cost more than 10 mins while using collectAsMap
function to collect RDD2 and sending it to each worker will cost 2 mins. But
the second method will get outOfMemery error while we try using big data.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficiently-join-this-two-complicated-rdds-tp1665p1714.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to