If each partition can fit in memory, you can do this using mapPartitions and then building an inverse mapping within each partition. You'd need to construct a hash map within each partition yourself.
On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya <aara...@gmail.com> wrote: > I have a use case where my RDD is set up such: > > Partition 0: > K1 -> [V1, V2] > K2 -> [V2] > > Partition 1: > K3 -> [V1] > K4 -> [V3] > > I want to invert this RDD, but only within a partition, so that the > operation does not require a shuffle. It doesn't matter if the partitions > of the inverted RDD have non unique keys across the partitions, for example: > > Partition 0: > V1 -> [K1] > V2 -> [K1, K2] > > Partition 1: > V1 -> [K3] > V3 -> [K4] > > Is there a way to do only a per-partition groupBy, instead of shuffling the > entire data? > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org