Let me clarify - taking 10000 elements of 50000 elements using top or 
takeOrdered is taking about 25-50s which seems to be slow. I also try to use 
sortByKey to sort the elements to get a time estimate and I get numbers in the 
same range.
I'm running this application on a cluster with 5 nodes and using 3 cores each.

I'm interested in knowing what tuning can be done to get better performance for 
top/takeOrdered  than 25-50s. I want to eventually scale it up to use 77 
million elements but right now I want to see how performance of either top or 
takeOrdered could be improved for smaller sample I'm using.

On Mar 11, 2015, at 7:24 PM, Imran Rashid 
<iras...@cloudera.com<mailto:iras...@cloudera.com>> wrote:

I am not entirely sure I understand your question -- are you saying:

* scoring a sample of 50k events is fast
* taking the top N scores of 77M events is slow, no matter what N is

?

if so, this shouldn't come as a huge surprise.  You can't find the top scoring 
elements (no matter how small N is) unless you score all 77M of them.  Very 
naively, you would expect scoring 77M events to take ~1000 times as long as 
scoring 50k events, right?  The fact that it doesn't take that much longer is 
probably b/c of the overhead of just launching the jobs.



On Mon, Mar 9, 2015 at 4:21 PM, Saba Sehrish 
<ssehr...@fnal.gov<mailto:ssehr...@fnal.gov>> wrote:


From: Saba Sehrish <ssehr...@fnal.gov<mailto:ssehr...@fnal.gov>>
Date: March 9, 2015 at 4:11:07 PM CDT
To: <user-...@spark.apache.org<mailto:user-...@spark.apache.org>>
Subject: Using top, takeOrdered, sortByKey

I am using spark for a template matching problem. We have 77 million events in 
the template library, and we compare energy of each of the input event with the 
each of the template event and return a score. In the end we return best 10000 
matches with lowest score. A score of 0 is a perfect match.

I down sampled the problem to use only 50k events. For a single event matching 
across all the events in the template (50k) I see 150-200ms for score 
calculation on 25 cores (using YARN cluster), but after that when I perform 
either a top or takeOrdered or even sortByKey the time reaches to 25-50s.
So far I am not able to figure out why such a huge gap going from a list of 
scores to a list of top 1000 scores and why sorting or getting best X matches 
is being dominant by a large factor. One thing I have noticed is that it 
doesn’t matter how many elements I return the time range is the same 25-50s for 
10 - 10000 elements.

Any suggestions? if I am not using API properly?

scores is JavaPairRDD<Integer, Double>, and I do something like
numbestmatches is 10, 100, 10000 or any number.

List <Tuple2<Integer, Double>> bestscores_list = 
scores.takeOrdered(numbestmatches, new TupleComparator());
Or
List <Tuple2<Integer, Double>> bestscores_list = scores.top(numbestmatches, new 
TupleComparator());
Or
List <Tuple2<Integer, Double>> bestscores_list = scores.sortByKey();

Reply via email to