How about outer join? On 9 May 2016 13:18, "Raghava Mutharaju" <m.vijayaragh...@gmail.com> wrote:
> Hello All, > > We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key > (number of partitions are same for both the RDDs). We would like to > subtract rdd2 from rdd1. > > The subtract code at > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala > seems to group the elements of both the RDDs using (x, null) where x is the > element of the RDD and partition them. Then it makes use of > subtractByKey(). This way, RDDs have to be repartitioned on x (which in our > case, is both key and value combined). In our case, both the RDDs are > already hash partitioned on the key of x. Can we take advantage of this by > having a PairRDD/HashPartitioner-aware subtract? Is there a way to use > mapPartitions() for this? > > We tried to broadcast rdd2 and use mapPartitions. But this turns out to be > memory consuming and inefficient. We tried to do a local set difference > between rdd1 and the broadcasted rdd2 (in mapPartitions of rdd1). We did > use destroy() on the broadcasted value, but it does not help. > > The current subtract method is slow for us. rdd1 and rdd2 are around 700MB > each and the subtract takes around 14 seconds. > > Any ideas on this issue is highly appreciated. > > Regards, > Raghava. >