Thank you very much for your answer. We have try the method above before. This is the problem during doing so.
1. We want to avoid collect method because we do this step in the iteration and the RDD2 changes in every iteration. So the speed ,usage of memory and network traffic bother us a lot. 2. The keys in RDD1 is not well-distributed. For example, key "1" is in every line but the total number of key "1765" in RDD1 may be less than 10. It will cause some workers have more data to process and cost more time. We have done some experiment using data which has much small size but same form. The method above will cost more than 10 mins while using collectAsMap function to collect RDD2 and sending it to each worker will cost 2 mins. But the second method will get outOfMemery error while we try using big data. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficiently-join-this-two-complicated-rdds-tp1665p1714.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
