Re: partitioned groupBy

2014-09-17 Thread Akshat Aranya
Patrick,

If I understand this correctly, I won't be able to do this in the closure
provided to mapPartitions() because that's going to be stateless, in the
sense that a hash map that I create within the closure would only be useful
for one call of MapPartitionsRDD.compute().  I guess I would need to
override mapPartitions() directly within my RDD.  Right?

On Tue, Sep 16, 2014 at 4:57 PM, Patrick Wendell pwend...@gmail.com wrote:

 If each partition can fit in memory, you can do this using
 mapPartitions and then building an inverse mapping within each
 partition. You'd need to construct a hash map within each partition
 yourself.

 On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote:
  I have a use case where my RDD is set up such:
 
  Partition 0:
  K1 - [V1, V2]
  K2 - [V2]
 
  Partition 1:
  K3 - [V1]
  K4 - [V3]
 
  I want to invert this RDD, but only within a partition, so that the
  operation does not require a shuffle.  It doesn't matter if the
 partitions
  of the inverted RDD have non unique keys across the partitions, for
 example:
 
  Partition 0:
  V1 - [K1]
  V2 - [K1, K2]
 
  Partition 1:
  V1 - [K3]
  V3 - [K4]
 
  Is there a way to do only a per-partition groupBy, instead of shuffling
 the
  entire data?
 



Re: partitioned groupBy

2014-09-17 Thread Patrick Wendell
If you'd like to re-use the resulting inverted map, you can persist the result:

x = myRdd.mapPartitions(create inverted map).persist()

Your function would create the reverse map and then return an iterator
over the keys in that map.


On Wed, Sep 17, 2014 at 1:04 PM, Akshat Aranya aara...@gmail.com wrote:
 Patrick,

 If I understand this correctly, I won't be able to do this in the closure
 provided to mapPartitions() because that's going to be stateless, in the
 sense that a hash map that I create within the closure would only be useful
 for one call of MapPartitionsRDD.compute().  I guess I would need to
 override mapPartitions() directly within my RDD.  Right?

 On Tue, Sep 16, 2014 at 4:57 PM, Patrick Wendell pwend...@gmail.com wrote:

 If each partition can fit in memory, you can do this using
 mapPartitions and then building an inverse mapping within each
 partition. You'd need to construct a hash map within each partition
 yourself.

 On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote:
  I have a use case where my RDD is set up such:
 
  Partition 0:
  K1 - [V1, V2]
  K2 - [V2]
 
  Partition 1:
  K3 - [V1]
  K4 - [V3]
 
  I want to invert this RDD, but only within a partition, so that the
  operation does not require a shuffle.  It doesn't matter if the
  partitions
  of the inverted RDD have non unique keys across the partitions, for
  example:
 
  Partition 0:
  V1 - [K1]
  V2 - [K1, K2]
 
  Partition 1:
  V1 - [K3]
  V3 - [K4]
 
  Is there a way to do only a per-partition groupBy, instead of shuffling
  the
  entire data?
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



partitioned groupBy

2014-09-16 Thread Akshat Aranya
I have a use case where my RDD is set up such:

Partition 0:
K1 - [V1, V2]
K2 - [V2]

Partition 1:
K3 - [V1]
K4 - [V3]

I want to invert this RDD, but only within a partition, so that the
operation does not require a shuffle.  It doesn't matter if the partitions
of the inverted RDD have non unique keys across the partitions, for example:

Partition 0:
V1 - [K1]
V2 - [K1, K2]

Partition 1:
V1 - [K3]
V3 - [K4]

Is there a way to do only a per-partition groupBy, instead of shuffling the
entire data?


Re: partitioned groupBy

2014-09-16 Thread Patrick Wendell
If each partition can fit in memory, you can do this using
mapPartitions and then building an inverse mapping within each
partition. You'd need to construct a hash map within each partition
yourself.

On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote:
 I have a use case where my RDD is set up such:

 Partition 0:
 K1 - [V1, V2]
 K2 - [V2]

 Partition 1:
 K3 - [V1]
 K4 - [V3]

 I want to invert this RDD, but only within a partition, so that the
 operation does not require a shuffle.  It doesn't matter if the partitions
 of the inverted RDD have non unique keys across the partitions, for example:

 Partition 0:
 V1 - [K1]
 V2 - [K1, K2]

 Partition 1:
 V1 - [K3]
 V3 - [K4]

 Is there a way to do only a per-partition groupBy, instead of shuffling the
 entire data?


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org