sortByKey is indeed O(n log n), it's a first pass to figure out even-sized 
partitions (by sampling the RDD), then a second pass to do a distributed 
merge-sort (first partition the data on each machine, then run a reduce phase 
that merges the data for each partition). The point where it becomes useful to 
scale out versus a single machine is probably pretty high, because 
communication over a network is *much* slower than memory bandwidth within a 
machine. Generally it would make the most sense for data that doesn't fit in 
memory on a single machine, or data that already starts out distributed.

Please also note that if you run Spark on just one multicore machine, it still 
goes through many of the same code paths as on a cluster (e.g. serializing data 
between tasks) -- it's not optimized to be as fast as, say, a multithreaded 
sort framework. So it wouldn't make a ton of sense to use it for that.

Matei

On September 15, 2014 at 10:32:14 PM, cjwang (c...@cjwang.us) wrote:

I wonder what algorithm is used to implement sortByKey? I assume it is some 
O(n*log(n)) parallelized on x number of nodes, right? 

Then, what size of data would make it worthwhile to use sortByKey on 
multiple processors rather than use standard Scala sort functions on a 
single processor (considering the overhead of putting stuff into RDDs and 
collecting them back)? 



-- 
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Complexity-Efficiency-of-SortByKey-tp14328.html
 
Sent from the Apache Spark User List mailing list archive at Nabble.com. 

--------------------------------------------------------------------- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 

Reply via email to