Re: taking top k values of rdd

Koert Kuipers Sat, 05 Jul 2014 10:35:24 -0700

so i was thinking along these lines, assuming i start with p partitions:
1) create a priority queue of size k per partition
2) repartition to create one partition
3) reduce


i guess the worry is that in step 2 the one partition needs to hold p
priority queues of size k in memory...
the benefit is that the p priority queues do not get send to the driver
(which is not on cluster)


On Sat, Jul 5, 2014 at 1:20 PM, Koert Kuipers <ko...@tresata.com> wrote:

> i guess i could create a single priorityque per partition, then shuffle to
> a new rdd with 1 partition, and then reduce?
>
>
> On Sat, Jul 5, 2014 at 1:16 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> my initial approach to taking top k values of a rdd was using a
>> priority-queue monoid. along these lines:
>>
>> rdd.mapPartitions({ items => Iterator.single(new PriorityQueue(...)) },
>> false).reduce(monoid.plus)
>>
>> this works fine, but looking at the code for reduce it first reduces
>> within a partition (which doesnt help me) and then sends the results to the
>> driver where these again get reduced. this means that for every partition
>> the (potentially very bulky) priorityqueue gets shipped to the driver.
>>
>> my driver is client side, not inside cluster, and i cannot change this,
>> so this shipping to driver of all these queues can be expensive.
>>
>> is there a better way to do this? should i try to a shuffle first to
>> reduce the partitions to the minimal amount (since number of queues shipped
>> is equal to number of partitions)?
>>
>> is was a way to reduce to a single item RDD, so the queues stay inside
>> cluster and i can retrieve the final result with RDD.first?
>>
>
>

Re: taking top k values of rdd

Reply via email to