Re: Top, takeOrdered, sortByKey

2015-03-11 Thread Saba Sehrish
Let me clarify - taking 1 elements of 5 elements using top or 
takeOrdered is taking about 25-50s which seems to be slow. I also try to use 
sortByKey to sort the elements to get a time estimate and I get numbers in the 
same range.
I'm running this application on a cluster with 5 nodes and using 3 cores each.

I'm interested in knowing what tuning can be done to get better performance for 
top/takeOrdered  than 25-50s. I want to eventually scale it up to use 77 
million elements but right now I want to see how performance of either top or 
takeOrdered could be improved for smaller sample I'm using.

On Mar 11, 2015, at 7:24 PM, Imran Rashid 
mailto:iras...@cloudera.com>> wrote:

I am not entirely sure I understand your question -- are you saying:

* scoring a sample of 50k events is fast
* taking the top N scores of 77M events is slow, no matter what N is

?

if so, this shouldn't come as a huge surprise.  You can't find the top scoring 
elements (no matter how small N is) unless you score all 77M of them.  Very 
naively, you would expect scoring 77M events to take ~1000 times as long as 
scoring 50k events, right?  The fact that it doesn't take that much longer is 
probably b/c of the overhead of just launching the jobs.



On Mon, Mar 9, 2015 at 4:21 PM, Saba Sehrish 
mailto:ssehr...@fnal.gov>> wrote:


From: Saba Sehrish mailto:ssehr...@fnal.gov>>
Date: March 9, 2015 at 4:11:07 PM CDT
To: mailto:user-...@spark.apache.org>>
Subject: Using top, takeOrdered, sortByKey

I am using spark for a template matching problem. We have 77 million events in 
the template library, and we compare energy of each of the input event with the 
each of the template event and return a score. In the end we return best 1 
matches with lowest score. A score of 0 is a perfect match.

I down sampled the problem to use only 50k events. For a single event matching 
across all the events in the template (50k) I see 150-200ms for score 
calculation on 25 cores (using YARN cluster), but after that when I perform 
either a top or takeOrdered or even sortByKey the time reaches to 25-50s.
So far I am not able to figure out why such a huge gap going from a list of 
scores to a list of top 1000 scores and why sorting or getting best X matches 
is being dominant by a large factor. One thing I have noticed is that it 
doesn’t matter how many elements I return the time range is the same 25-50s for 
10 - 1 elements.

Any suggestions? if I am not using API properly?

scores is JavaPairRDD, and I do something like
numbestmatches is 10, 100, 1 or any number.

List > bestscores_list = 
scores.takeOrdered(numbestmatches, new TupleComparator());
Or
List > bestscores_list = scores.top(numbestmatches, new 
TupleComparator());
Or
List > bestscores_list = scores.sortByKey();



Re: Top, takeOrdered, sortByKey

2015-03-11 Thread Imran Rashid
I am not entirely sure I understand your question -- are you saying:

* scoring a sample of 50k events is fast
* taking the top N scores of 77M events is slow, no matter what N is

?

if so, this shouldn't come as a huge surprise.  You can't find the top
scoring elements (no matter how small N is) unless you score all 77M of
them.  Very naively, you would expect scoring 77M events to take ~1000
times as long as scoring 50k events, right?  The fact that it doesn't take
that much longer is probably b/c of the overhead of just launching the jobs.



On Mon, Mar 9, 2015 at 4:21 PM, Saba Sehrish  wrote:

>
>
>  *From:* Saba Sehrish 
> *Date:* March 9, 2015 at 4:11:07 PM CDT
> *To:* 
> *Subject:* *Using top, takeOrdered, sortByKey*
>
>   I am using spark for a template matching problem. We have 77 million
> events in the template library, and we compare energy of each of the input
> event with the each of the template event and return a score. In the end we
> return best 1 matches with lowest score. A score of 0 is a perfect
> match.
>
>  I down sampled the problem to use only 50k events. For a single event
> matching across all the events in the template (50k) I see 150-200ms for
> score calculation on 25 cores (using YARN cluster), but after that when I
> perform either a top or takeOrdered or even sortByKey the time reaches to
> 25-50s.
> So far I am not able to figure out why such a huge gap going from a list
> of scores to a list of top 1000 scores and why sorting or getting best X
> matches is being dominant by a large factor. One thing I have noticed is
> that it doesn’t matter how many elements I return the time range is the
> same 25-50s for 10 - 1 elements.
>
>  Any suggestions? if I am not using API properly?
>
>  scores is JavaPairRDD, and I do something like
> numbestmatches is 10, 100, 1 or any number.
>
>   List > bestscores_list =
> scores.takeOrdered(numbestmatches, new TupleComparator());
>  Or
>  List > bestscores_list =
> scores.top(numbestmatches, new TupleComparator());
>  Or
>  List > bestscores_list = scores.sortByKey();
>
>


Top, takeOrdered, sortByKey

2015-03-09 Thread Saba Sehrish


From: Saba Sehrish mailto:ssehr...@fnal.gov>>
Date: March 9, 2015 at 4:11:07 PM CDT
To: mailto:user-...@spark.apache.org>>
Subject: Using top, takeOrdered, sortByKey

I am using spark for a template matching problem. We have 77 million events in 
the template library, and we compare energy of each of the input event with the 
each of the template event and return a score. In the end we return best 1 
matches with lowest score. A score of 0 is a perfect match.

I down sampled the problem to use only 50k events. For a single event matching 
across all the events in the template (50k) I see 150-200ms for score 
calculation on 25 cores (using YARN cluster), but after that when I perform 
either a top or takeOrdered or even sortByKey the time reaches to 25-50s.
So far I am not able to figure out why such a huge gap going from a list of 
scores to a list of top 1000 scores and why sorting or getting best X matches 
is being dominant by a large factor. One thing I have noticed is that it 
doesn’t matter how many elements I return the time range is the same 25-50s for 
10 - 1 elements.

Any suggestions? if I am not using API properly?

scores is JavaPairRDD, and I do something like
numbestmatches is 10, 100, 1 or any number.

List > bestscores_list = 
scores.takeOrdered(numbestmatches, new TupleComparator());
Or
List > bestscores_list = scores.top(numbestmatches, new 
TupleComparator());
Or
List > bestscores_list = scores.sortByKey();