So if I do something like this, spark handles the parallelization and recombination of sum and count on the cluster automatically? I started peeking into the source and see that foreach does submit a job to the cluster, but it looked like the inner function needed to return something to work properly.
val grouped = rdd.groupByKey() grouped.foreach{ x => val iterable = x._2 var sum = 0.0 var count = 0 iterable.foreach{ y => sum = sum + y.foo count = count + 1 } val mean = sum/count; // save mean to database... } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p11207.html Sent from the Apache Spark User List mailing list archive at Nabble.com.