You can do it in-memory as well....get 10% topK elements from each
partition and use merge from any sort algorithm like timsort....basically
aggregateBy

Your version uses shuffle but this version is 0 shuffle..assuming your data
set is cached you will be using in-memory allReduce through treeAggregate...

But this is only good for top 10% or bottom 10%...if you need to do it for
top 30% then may be the shuffle version will work better...

On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet <aung....@gmail.com> wrote:

> Hi all,
>
> I have a distribution represented as an RDD of tuples, in rows of
> (segment, score)
> For each segment, I want to discard tuples with top X percent scores. This
> seems hard to do in Spark RDD.
>
> A naive algorithm would be -
>
> 1) Sort RDD by segment & score (descending)
> 2) Within each segment, number the rows from top to bottom.
> 3) For each  segment, calculate the cut off index. i.e. 90 for 10% cut off
> out of a segment with 100 rows.
> 4) For the entire RDD, filter rows with row num <= cut off index
>
> This does not look like a good algorithm. I would really appreciate if
> someone can suggest a better way to implement this in Spark.
>
> Regards,
> Aung
>

Reply via email to