To make it efficient in your case you may need to do a bit of custom code to 
emit the top k per partition and then only send those to the driver. On the 
driver you can just top k the combined top k from each partition (assuming you 
have (object, count) for each top k list).

—
Sent from Mailbox

On Sat, Jul 5, 2014 at 10:17 AM, Koert Kuipers <ko...@tresata.com> wrote:

> my initial approach to taking top k values of a rdd was using a
> priority-queue monoid. along these lines:
> rdd.mapPartitions({ items => Iterator.single(new PriorityQueue(...)) },
> false).reduce(monoid.plus)
> this works fine, but looking at the code for reduce it first reduces within
> a partition (which doesnt help me) and then sends the results to the driver
> where these again get reduced. this means that for every partition the
> (potentially very bulky) priorityqueue gets shipped to the driver.
> my driver is client side, not inside cluster, and i cannot change this, so
> this shipping to driver of all these queues can be expensive.
> is there a better way to do this? should i try to a shuffle first to reduce
> the partitions to the minimal amount (since number of queues shipped is
> equal to number of partitions)?
> is was a way to reduce to a single item RDD, so the queues stay inside
> cluster and i can retrieve the final result with RDD.first?

Reply via email to