No, it's definitely not done on the driver. It works as you say. Look at the source code for RDD.takeOrdered, which is what top calls.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130 On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar <[email protected]> wrote: > I'm looking to select the top n records (by rank) from a data set of a few > hundred GB's. My understanding is that JavaRDD.top(n, comparator) is > entirely a driver-side operation in that all records are sorted in the > driver's memory. I prefer an approach where the records are sorted on the > cluster and only the top ones sent to the driver. I'm hence leaning towards > creating a JavaPairRDD on a key, then sorting the rdd by key and invoking > take(N). I'd like to know if rdd.top achieves the same result (while being > executed on the cluster) as take or if my assumption that it's a driver side > operation is correct. > > Thanks, > Bharath
