I have what seems like a relatively straightforward task to accomplish, but I
cannot seem to figure it out from the Spark documentation or searching the
mailing list.

I have an RDD[(String, MyClass)] that I would like to group by the key, and
calculate the mean and standard deviation of the "foo" field of MyClass. It
"feels" like I should be able to use group by to get an RDD for each unique
key, but it gives me an iterable.

As in:

val grouped = rdd.groupByKey()

grouped.foreach{g =>
   val mean = g.map( x => x.foo).mean()
   val dev = g.map( x => x.foo ).stddev()
   // do fancy things with the mean and deviation
}

However, there seems to be no way to convert the iterable into an RDD. Is
there some other technique for doing this? I'm to the point where I'm
considering copying and pasting the StatCollector class and changing the
type from Double to MyClass (or making it generic).

Am I going down the wrong path?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to