I meant not sure how to do variance in one shot :-) With mean in hand, you can obvious broadcast the variable, and do another map/reduce to calculate variance per key.
On Fri, Aug 1, 2014 at 4:39 PM, Xu (Simon) Chen <xche...@gmail.com> wrote: > val res = rdd.map(t => (t._1, (t._2.foo, 1))).reduceByKey((x,y) => > (x._1+x._2, y._1+y._2)).collect > > This gives you a list of (key, (tot, count)), which you can easily > calculate the mean. Not sure about variance. > > > On Fri, Aug 1, 2014 at 2:55 PM, kriskalish <k...@kalish.net> wrote: > >> I have what seems like a relatively straightforward task to accomplish, >> but I >> cannot seem to figure it out from the Spark documentation or searching the >> mailing list. >> >> I have an RDD[(String, MyClass)] that I would like to group by the key, >> and >> calculate the mean and standard deviation of the "foo" field of MyClass. >> It >> "feels" like I should be able to use group by to get an RDD for each >> unique >> key, but it gives me an iterable. >> >> As in: >> >> val grouped = rdd.groupByKey() >> >> grouped.foreach{g => >> val mean = g.map( x => x.foo).mean() >> val dev = g.map( x => x.foo ).stddev() >> // do fancy things with the mean and deviation >> } >> >> However, there seems to be no way to convert the iterable into an RDD. Is >> there some other technique for doing this? I'm to the point where I'm >> considering copying and pasting the StatCollector class and changing the >> type from Double to MyClass (or making it generic). >> >> Am I going down the wrong path? >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >