Re: partitioned groupBy

Patrick Wendell Tue, 16 Sep 2014 16:58:11 -0700

If each partition can fit in memory, you can do this using
mapPartitions and then building an inverse mapping within each
partition. You'd need to construct a hash map within each partition
yourself.


On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya <aara...@gmail.com> wrote:
> I have a use case where my RDD is set up such:
>
> Partition 0:
> K1 -> [V1, V2]
> K2 -> [V2]
>
> Partition 1:
> K3 -> [V1]
> K4 -> [V3]
>
> I want to invert this RDD, but only within a partition, so that the
> operation does not require a shuffle.  It doesn't matter if the partitions
> of the inverted RDD have non unique keys across the partitions, for example:
>
> Partition 0:
> V1 -> [K1]
> V2 -> [K1, K2]
>
> Partition 1:
> V1 -> [K3]
> V3 -> [K4]
>
> Is there a way to do only a per-partition groupBy, instead of shuffling the
> entire data?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: partitioned groupBy

Reply via email to