If you'd like to re-use the resulting inverted map, you can persist the result:

x = myRdd.mapPartitions(create inverted map).persist()

Your function would create the reverse map and then return an iterator
over the keys in that map.


On Wed, Sep 17, 2014 at 1:04 PM, Akshat Aranya <aara...@gmail.com> wrote:
> Patrick,
>
> If I understand this correctly, I won't be able to do this in the closure
> provided to mapPartitions() because that's going to be stateless, in the
> sense that a hash map that I create within the closure would only be useful
> for one call of MapPartitionsRDD.compute().  I guess I would need to
> override mapPartitions() directly within my RDD.  Right?
>
> On Tue, Sep 16, 2014 at 4:57 PM, Patrick Wendell <pwend...@gmail.com> wrote:
>>
>> If each partition can fit in memory, you can do this using
>> mapPartitions and then building an inverse mapping within each
>> partition. You'd need to construct a hash map within each partition
>> yourself.
>>
>> On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya <aara...@gmail.com> wrote:
>> > I have a use case where my RDD is set up such:
>> >
>> > Partition 0:
>> > K1 -> [V1, V2]
>> > K2 -> [V2]
>> >
>> > Partition 1:
>> > K3 -> [V1]
>> > K4 -> [V3]
>> >
>> > I want to invert this RDD, but only within a partition, so that the
>> > operation does not require a shuffle.  It doesn't matter if the
>> > partitions
>> > of the inverted RDD have non unique keys across the partitions, for
>> > example:
>> >
>> > Partition 0:
>> > V1 -> [K1]
>> > V2 -> [K1, K2]
>> >
>> > Partition 1:
>> > V1 -> [K3]
>> > V3 -> [K4]
>> >
>> > Is there a way to do only a per-partition groupBy, instead of shuffling
>> > the
>> > entire data?
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to