Re: Implementing percentile through top Vs take
Thanks. Turns out I needed the RDD sorted for another purpose, so keeping a sorted pair rdd anyway made sense. And apologies for not reading the source for top before asking the question (/*poor attempt to save time*/). On Thu, Jul 31, 2014 at 12:34 AM, Sean Owen wrote: > No, it's definitely not done on the driver. It works as you say. Look > at the source code for RDD.takeOrdered, which is what top calls. > > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130 > > On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar > wrote: > > I'm looking to select the top n records (by rank) from a data set of a > few > > hundred GB's. My understanding is that JavaRDD.top(n, comparator) is > > entirely a driver-side operation in that all records are sorted in the > > driver's memory. I prefer an approach where the records are sorted on the > > cluster and only the top ones sent to the driver. I'm hence leaning > towards > > creating a JavaPairRDD on a key, then sorting the rdd by key and > invoking > > take(N). I'd like to know if rdd.top achieves the same result (while > being > > executed on the cluster) as take or if my assumption that it's a driver > side > > operation is correct. > > > > Thanks, > > Bharath >
Re: Implementing percentile through top Vs take
No, it's definitely not done on the driver. It works as you say. Look at the source code for RDD.takeOrdered, which is what top calls. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130 On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar wrote: > I'm looking to select the top n records (by rank) from a data set of a few > hundred GB's. My understanding is that JavaRDD.top(n, comparator) is > entirely a driver-side operation in that all records are sorted in the > driver's memory. I prefer an approach where the records are sorted on the > cluster and only the top ones sent to the driver. I'm hence leaning towards > creating a JavaPairRDD on a key, then sorting the rdd by key and invoking > take(N). I'd like to know if rdd.top achieves the same result (while being > executed on the cluster) as take or if my assumption that it's a driver side > operation is correct. > > Thanks, > Bharath
Implementing percentile through top Vs take
I'm looking to select the top n records (by rank) from a data set of a few hundred GB's. My understanding is that JavaRDD.top(n, comparator) is entirely a driver-side operation in that all records are sorted in the driver's memory. I prefer an approach where the records are sorted on the cluster and only the top ones sent to the driver. I'm hence leaning towards creating a JavaPairRDD on a key, then sorting the rdd by key and invoking take(N). I'd like to know if rdd.top achieves the same result (while being executed on the cluster) as take or if my assumption that it's a driver side operation is correct. Thanks, Bharath