so i was thinking along these lines, assuming i start with p partitions: 1) create a priority queue of size k per partition 2) repartition to create one partition 3) reduce
i guess the worry is that in step 2 the one partition needs to hold p priority queues of size k in memory... the benefit is that the p priority queues do not get send to the driver (which is not on cluster) On Sat, Jul 5, 2014 at 1:20 PM, Koert Kuipers <ko...@tresata.com> wrote: > i guess i could create a single priorityque per partition, then shuffle to > a new rdd with 1 partition, and then reduce? > > > On Sat, Jul 5, 2014 at 1:16 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> my initial approach to taking top k values of a rdd was using a >> priority-queue monoid. along these lines: >> >> rdd.mapPartitions({ items => Iterator.single(new PriorityQueue(...)) }, >> false).reduce(monoid.plus) >> >> this works fine, but looking at the code for reduce it first reduces >> within a partition (which doesnt help me) and then sends the results to the >> driver where these again get reduced. this means that for every partition >> the (potentially very bulky) priorityqueue gets shipped to the driver. >> >> my driver is client side, not inside cluster, and i cannot change this, >> so this shipping to driver of all these queues can be expensive. >> >> is there a better way to do this? should i try to a shuffle first to >> reduce the partitions to the minimal amount (since number of queues shipped >> is equal to number of partitions)? >> >> is was a way to reduce to a single item RDD, so the queues stay inside >> cluster and i can retrieve the final result with RDD.first? >> > >