How about outer join?
On 9 May 2016 13:18, "Raghava Mutharaju" <m.vijayaragh...@gmail.com> wrote:

> Hello All,
>
> We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key
> (number of partitions are same for both the RDDs). We would like to
> subtract rdd2 from rdd1.
>
> The subtract code at
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
> seems to group the elements of both the RDDs using (x, null) where x is the
> element of the RDD and partition them. Then it makes use of
> subtractByKey(). This way, RDDs have to be repartitioned on x (which in our
> case, is both key and value combined). In our case, both the RDDs are
> already hash partitioned on the key of x. Can we take advantage of this by
> having a PairRDD/HashPartitioner-aware subtract? Is there a way to use
> mapPartitions() for this?
>
> We tried to broadcast rdd2 and use mapPartitions. But this turns out to be
> memory consuming and inefficient. We tried to do a local set difference
> between rdd1 and the broadcasted rdd2 (in mapPartitions of rdd1). We did
> use destroy() on the broadcasted value, but it does not help.
>
> The current subtract method is slow for us. rdd1 and rdd2 are around 700MB
> each and the subtract takes around 14 seconds.
>
> Any ideas on this issue is highly appreciated.
>
> Regards,
> Raghava.
>

Reply via email to