Hi, I've seen in a few cases that when calling a reduce operation, it is executed sequentially rather than in parallel.
For example, I have the following code that performs a simple word counting on very big data using hashmaps (instead of (word,1) pairs that would overflow the memory at shuffle time) : rdd.mapPartitions( iter => { val x = new HashMap[Int,Int] // fill x with (word, count) values val rez = new ArrayBuffer[HashMap] rez += counts rez.toArray.iterator }) .reduce({ case (h1, h2) => { for (key <- h2.keys()) { if (h1.containsKey(key)) { h1.put(key, h1.get(key) + h2.get(key)) } else { h1.put(key, h2.get(key)) } } h2.clear() h1 } }) After all the mappers are done, the process becomes single threaded where each reducer is executed sequentially. This is very time inefficient and I don't understand why the reduce operation is not executed in parallel as expected. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reducer-doesn-t-operate-in-parallel-tp26389.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org