If you'd like to re-use the resulting inverted map, you can persist the result:
x = myRdd.mapPartitions(create inverted map).persist() Your function would create the reverse map and then return an iterator over the keys in that map. On Wed, Sep 17, 2014 at 1:04 PM, Akshat Aranya <aara...@gmail.com> wrote: > Patrick, > > If I understand this correctly, I won't be able to do this in the closure > provided to mapPartitions() because that's going to be stateless, in the > sense that a hash map that I create within the closure would only be useful > for one call of MapPartitionsRDD.compute(). I guess I would need to > override mapPartitions() directly within my RDD. Right? > > On Tue, Sep 16, 2014 at 4:57 PM, Patrick Wendell <pwend...@gmail.com> wrote: >> >> If each partition can fit in memory, you can do this using >> mapPartitions and then building an inverse mapping within each >> partition. You'd need to construct a hash map within each partition >> yourself. >> >> On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya <aara...@gmail.com> wrote: >> > I have a use case where my RDD is set up such: >> > >> > Partition 0: >> > K1 -> [V1, V2] >> > K2 -> [V2] >> > >> > Partition 1: >> > K3 -> [V1] >> > K4 -> [V3] >> > >> > I want to invert this RDD, but only within a partition, so that the >> > operation does not require a shuffle. It doesn't matter if the >> > partitions >> > of the inverted RDD have non unique keys across the partitions, for >> > example: >> > >> > Partition 0: >> > V1 -> [K1] >> > V2 -> [K1, K2] >> > >> > Partition 1: >> > V1 -> [K3] >> > V3 -> [K4] >> > >> > Is there a way to do only a per-partition groupBy, instead of shuffling >> > the >> > entire data? >> > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org