Hi nilmish,
One option for you is to consider moving to a different algorithm. The
SpaceSaver/StreamSummary method will get you approximate results in exchange
for smaller data structure size. It has an implementation in Twitter's
Algebird library, if you're using Scala:
Hello Mohit,
I don't think there's a direct way of bleeding elements across partitions.
But you could write it yourself relatively succinctly:
A) Sort the RDD
B) Look at the sorted RDD's partitions with the .mapParititionsWithIndex( )
method. Map each partition to its partition ID, and its