Re: Implementing percentile through top Vs take

2014-07-31 Thread Bharath Ravi Kumar
Thanks. Turns out I needed the RDD sorted for another purpose, so keeping a
sorted pair rdd anyway made sense. And apologies for not reading the source
for top before asking the question (/*poor attempt to save time*/).


On Thu, Jul 31, 2014 at 12:34 AM, Sean Owen  wrote:

> No, it's definitely not done on the driver. It works as you say. Look
> at the source code for RDD.takeOrdered, which is what top calls.
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130
>
> On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar 
> wrote:
> > I'm looking to select the top n records (by rank) from a data set of a
> few
> > hundred GB's. My understanding is that JavaRDD.top(n, comparator) is
> > entirely a driver-side operation in that all records are sorted in the
> > driver's memory. I prefer an approach where the records are sorted on the
> > cluster and only the top ones sent to the driver. I'm hence leaning
> towards
> > creating a  JavaPairRDD on a key, then sorting the rdd by key and
> invoking
> > take(N). I'd like to know if rdd.top achieves the same result (while
> being
> > executed on the cluster) as take or if my assumption that it's a driver
> side
> > operation is correct.
> >
> > Thanks,
> > Bharath
>


Re: Implementing percentile through top Vs take

2014-07-30 Thread Sean Owen
No, it's definitely not done on the driver. It works as you say. Look
at the source code for RDD.takeOrdered, which is what top calls.

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130

On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar  wrote:
> I'm looking to select the top n records (by rank) from a data set of a few
> hundred GB's. My understanding is that JavaRDD.top(n, comparator) is
> entirely a driver-side operation in that all records are sorted in the
> driver's memory. I prefer an approach where the records are sorted on the
> cluster and only the top ones sent to the driver. I'm hence leaning towards
> creating a  JavaPairRDD on a key, then sorting the rdd by key and invoking
> take(N). I'd like to know if rdd.top achieves the same result (while being
> executed on the cluster) as take or if my assumption that it's a driver side
> operation is correct.
>
> Thanks,
> Bharath


Implementing percentile through top Vs take

2014-07-30 Thread Bharath Ravi Kumar
I'm looking to select the top n records (by rank) from a data set of a few
hundred GB's. My understanding is that JavaRDD.top(n, comparator) is
entirely a driver-side operation in that all records are sorted in the
driver's memory. I prefer an approach where the records are sorted on the
cluster and only the top ones sent to the driver. I'm hence leaning towards
creating a  JavaPairRDD on a key, then sorting the rdd by key and invoking
take(N). I'd like to know if rdd.top achieves the same result (while being
executed on the cluster) as take or if my assumption that it's a driver
side operation is correct.

Thanks,
Bharath