Re: partitioned groupBy
Patrick, If I understand this correctly, I won't be able to do this in the closure provided to mapPartitions() because that's going to be stateless, in the sense that a hash map that I create within the closure would only be useful for one call of MapPartitionsRDD.compute(). I guess I would need to override mapPartitions() directly within my RDD. Right? On Tue, Sep 16, 2014 at 4:57 PM, Patrick Wendell pwend...@gmail.com wrote: If each partition can fit in memory, you can do this using mapPartitions and then building an inverse mapping within each partition. You'd need to construct a hash map within each partition yourself. On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote: I have a use case where my RDD is set up such: Partition 0: K1 - [V1, V2] K2 - [V2] Partition 1: K3 - [V1] K4 - [V3] I want to invert this RDD, but only within a partition, so that the operation does not require a shuffle. It doesn't matter if the partitions of the inverted RDD have non unique keys across the partitions, for example: Partition 0: V1 - [K1] V2 - [K1, K2] Partition 1: V1 - [K3] V3 - [K4] Is there a way to do only a per-partition groupBy, instead of shuffling the entire data?
Re: partitioned groupBy
If you'd like to re-use the resulting inverted map, you can persist the result: x = myRdd.mapPartitions(create inverted map).persist() Your function would create the reverse map and then return an iterator over the keys in that map. On Wed, Sep 17, 2014 at 1:04 PM, Akshat Aranya aara...@gmail.com wrote: Patrick, If I understand this correctly, I won't be able to do this in the closure provided to mapPartitions() because that's going to be stateless, in the sense that a hash map that I create within the closure would only be useful for one call of MapPartitionsRDD.compute(). I guess I would need to override mapPartitions() directly within my RDD. Right? On Tue, Sep 16, 2014 at 4:57 PM, Patrick Wendell pwend...@gmail.com wrote: If each partition can fit in memory, you can do this using mapPartitions and then building an inverse mapping within each partition. You'd need to construct a hash map within each partition yourself. On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote: I have a use case where my RDD is set up such: Partition 0: K1 - [V1, V2] K2 - [V2] Partition 1: K3 - [V1] K4 - [V3] I want to invert this RDD, but only within a partition, so that the operation does not require a shuffle. It doesn't matter if the partitions of the inverted RDD have non unique keys across the partitions, for example: Partition 0: V1 - [K1] V2 - [K1, K2] Partition 1: V1 - [K3] V3 - [K4] Is there a way to do only a per-partition groupBy, instead of shuffling the entire data? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
partitioned groupBy
I have a use case where my RDD is set up such: Partition 0: K1 - [V1, V2] K2 - [V2] Partition 1: K3 - [V1] K4 - [V3] I want to invert this RDD, but only within a partition, so that the operation does not require a shuffle. It doesn't matter if the partitions of the inverted RDD have non unique keys across the partitions, for example: Partition 0: V1 - [K1] V2 - [K1, K2] Partition 1: V1 - [K3] V3 - [K4] Is there a way to do only a per-partition groupBy, instead of shuffling the entire data?
Re: partitioned groupBy
If each partition can fit in memory, you can do this using mapPartitions and then building an inverse mapping within each partition. You'd need to construct a hash map within each partition yourself. On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote: I have a use case where my RDD is set up such: Partition 0: K1 - [V1, V2] K2 - [V2] Partition 1: K3 - [V1] K4 - [V3] I want to invert this RDD, but only within a partition, so that the operation does not require a shuffle. It doesn't matter if the partitions of the inverted RDD have non unique keys across the partitions, for example: Partition 0: V1 - [K1] V2 - [K1, K2] Partition 1: V1 - [K3] V3 - [K4] Is there a way to do only a per-partition groupBy, instead of shuffling the entire data? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org