Re: Implementing percentile through top Vs take

Bharath Ravi Kumar Thu, 31 Jul 2014 03:19:12 -0700

Thanks. Turns out I needed the RDD sorted for another purpose, so keeping a
sorted pair rdd anyway made sense. And apologies for not reading the source
for top before asking the question (/*poor attempt to save time*/).



On Thu, Jul 31, 2014 at 12:34 AM, Sean Owen <[email protected]> wrote:

> No, it's definitely not done on the driver. It works as you say. Look
> at the source code for RDD.takeOrdered, which is what top calls.
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130
>
> On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar <[email protected]>
> wrote:
> > I'm looking to select the top n records (by rank) from a data set of a
> few
> > hundred GB's. My understanding is that JavaRDD.top(n, comparator) is
> > entirely a driver-side operation in that all records are sorted in the
> > driver's memory. I prefer an approach where the records are sorted on the
> > cluster and only the top ones sent to the driver. I'm hence leaning
> towards
> > creating a  JavaPairRDD on a key, then sorting the rdd by key and
> invoking
> > take(N). I'd like to know if rdd.top achieves the same result (while
> being
> > executed on the cluster) as take or if my assumption that it's a driver
> side
> > operation is correct.
> >
> > Thanks,
> > Bharath
>

Re: Implementing percentile through top Vs take

Reply via email to