Re: Implementing percentile through top Vs take

Sean Owen Wed, 30 Jul 2014 12:05:55 -0700

No, it's definitely not done on the driver. It works as you say. Look
at the source code for RDD.takeOrdered, which is what top calls.


https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130

On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar <[email protected]> wrote:
> I'm looking to select the top n records (by rank) from a data set of a few
> hundred GB's. My understanding is that JavaRDD.top(n, comparator) is
> entirely a driver-side operation in that all records are sorted in the
> driver's memory. I prefer an approach where the records are sorted on the
> cluster and only the top ones sent to the driver. I'm hence leaning towards
> creating a  JavaPairRDD on a key, then sorting the rdd by key and invoking
> take(N). I'd like to know if rdd.top achieves the same result (while being
> executed on the cluster) as take or if my assumption that it's a driver side
> operation is correct.
>
> Thanks,
> Bharath

Re: Implementing percentile through top Vs take

Reply via email to