You can do it in-memory as well....get 10% topK elements from each partition and use merge from any sort algorithm like timsort....basically aggregateBy
Your version uses shuffle but this version is 0 shuffle..assuming your data set is cached you will be using in-memory allReduce through treeAggregate... But this is only good for top 10% or bottom 10%...if you need to do it for top 30% then may be the shuffle version will work better... On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet <aung....@gmail.com> wrote: > Hi all, > > I have a distribution represented as an RDD of tuples, in rows of > (segment, score) > For each segment, I want to discard tuples with top X percent scores. This > seems hard to do in Spark RDD. > > A naive algorithm would be - > > 1) Sort RDD by segment & score (descending) > 2) Within each segment, number the rows from top to bottom. > 3) For each segment, calculate the cut off index. i.e. 90 for 10% cut off > out of a segment with 100 rows. > 4) For the entire RDD, filter rows with row num <= cut off index > > This does not look like a good algorithm. I would really appreciate if > someone can suggest a better way to implement this in Spark. > > Regards, > Aung >