Thanks. Turns out I needed the RDD sorted for another purpose, so keeping a sorted pair rdd anyway made sense. And apologies for not reading the source for top before asking the question (/*poor attempt to save time*/).
On Thu, Jul 31, 2014 at 12:34 AM, Sean Owen <[email protected]> wrote: > No, it's definitely not done on the driver. It works as you say. Look > at the source code for RDD.takeOrdered, which is what top calls. > > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130 > > On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar <[email protected]> > wrote: > > I'm looking to select the top n records (by rank) from a data set of a > few > > hundred GB's. My understanding is that JavaRDD.top(n, comparator) is > > entirely a driver-side operation in that all records are sorted in the > > driver's memory. I prefer an approach where the records are sorted on the > > cluster and only the top ones sent to the driver. I'm hence leaning > towards > > creating a JavaPairRDD on a key, then sorting the rdd by key and > invoking > > take(N). I'd like to know if rdd.top achieves the same result (while > being > > executed on the cluster) as take or if my assumption that it's a driver > side > > operation is correct. > > > > Thanks, > > Bharath >
