Here's the more functional programming-friendly take on the
computation (but yeah this is the naive formula):

rdd.groupByKey.mapValues { mcs =>
  val values = mcs.map(_.foo.toDouble)
  val n = values.count
  val sum = values.sum
  val sumSquares = values.map(x => x * x).sum
  math.sqrt(n * sumSquares - sum * sum) / n
}

This gives you a bunch of (key,stdev). I think you want to compute
this RDD and *then* do something to save it if you like. Sure, that
could be collecting it locally and saving to a DB. Or you could use
foreach to do something remotely for every key-value pair. More
efficient would be to mapPartitions and do something to a whole
partition of key-value pairs at a time.


On Fri, Aug 1, 2014 at 9:56 PM, kriskalish <k...@kalish.net> wrote:
> So if I do something like this, spark handles the parallelization and
> recombination of sum and count on the cluster automatically? I started
> peeking into the source and see that foreach does submit a job to the
> cluster, but it looked like the inner function needed to return something to
> work properly.
>
> val grouped = rdd.groupByKey()
> grouped.foreach{ x =>
>     val iterable = x._2
>     var sum = 0.0
>     var count = 0
>     iterable.foreach{ y =>
>         sum = sum + y.foo
>         count = count + 1
>     }
>     val mean = sum/count;
>     // save mean to database...
> }
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p11207.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to