So if I do something like this, spark handles the parallelization and
recombination of sum and count on the cluster automatically? I started
peeking into the source and see that foreach does submit a job to the
cluster, but it looked like the inner function needed to return something to
work properly.

val grouped = rdd.groupByKey()
grouped.foreach{ x =>
    val iterable = x._2
    var sum = 0.0
    var count = 0
    iterable.foreach{ y =>
        sum = sum + y.foo
        count = count + 1
    }
    val mean = sum/count;
    // save mean to database...
}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p11207.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to