To make it efficient in your case you may need to do a bit of custom code to emit the top k per partition and then only send those to the driver. On the driver you can just top k the combined top k from each partition (assuming you have (object, count) for each top k list).
— Sent from Mailbox On Sat, Jul 5, 2014 at 10:17 AM, Koert Kuipers <ko...@tresata.com> wrote: > my initial approach to taking top k values of a rdd was using a > priority-queue monoid. along these lines: > rdd.mapPartitions({ items => Iterator.single(new PriorityQueue(...)) }, > false).reduce(monoid.plus) > this works fine, but looking at the code for reduce it first reduces within > a partition (which doesnt help me) and then sends the results to the driver > where these again get reduced. this means that for every partition the > (potentially very bulky) priorityqueue gets shipped to the driver. > my driver is client side, not inside cluster, and i cannot change this, so > this shipping to driver of all these queues can be expensive. > is there a better way to do this? should i try to a shuffle first to reduce > the partitions to the minimal amount (since number of queues shipped is > equal to number of partitions)? > is was a way to reduce to a single item RDD, so the queues stay inside > cluster and i can retrieve the final result with RDD.first?