Here's the more functional programming-friendly take on the computation (but yeah this is the naive formula):
rdd.groupByKey.mapValues { mcs => val values = mcs.map(_.foo.toDouble) val n = values.count val sum = values.sum val sumSquares = values.map(x => x * x).sum math.sqrt(n * sumSquares - sum * sum) / n } This gives you a bunch of (key,stdev). I think you want to compute this RDD and *then* do something to save it if you like. Sure, that could be collecting it locally and saving to a DB. Or you could use foreach to do something remotely for every key-value pair. More efficient would be to mapPartitions and do something to a whole partition of key-value pairs at a time. On Fri, Aug 1, 2014 at 9:56 PM, kriskalish <k...@kalish.net> wrote: > So if I do something like this, spark handles the parallelization and > recombination of sum and count on the cluster automatically? I started > peeking into the source and see that foreach does submit a job to the > cluster, but it looked like the inner function needed to return something to > work properly. > > val grouped = rdd.groupByKey() > grouped.foreach{ x => > val iterable = x._2 > var sum = 0.0 > var count = 0 > iterable.foreach{ y => > sum = sum + y.foo > count = count + 1 > } > val mean = sum/count; > // save mean to database... > } > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p11207.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.