Re: Computing mean and standard deviation by key

Xu (Simon) Chen Fri, 01 Aug 2014 13:49:30 -0700

I meant not sure how to do variance in one shot :-)

With mean in hand, you can obvious broadcast the variable, and do another
map/reduce to calculate variance per key.



On Fri, Aug 1, 2014 at 4:39 PM, Xu (Simon) Chen <xche...@gmail.com> wrote:

> val res = rdd.map(t => (t._1, (t._2.foo, 1))).reduceByKey((x,y) =>
> (x._1+x._2, y._1+y._2)).collect
>
> This gives you a list of (key, (tot, count)), which you can easily
> calculate the mean. Not sure about variance.
>
>
> On Fri, Aug 1, 2014 at 2:55 PM, kriskalish <k...@kalish.net> wrote:
>
>> I have what seems like a relatively straightforward task to accomplish,
>> but I
>> cannot seem to figure it out from the Spark documentation or searching the
>> mailing list.
>>
>> I have an RDD[(String, MyClass)] that I would like to group by the key,
>> and
>> calculate the mean and standard deviation of the "foo" field of MyClass.
>> It
>> "feels" like I should be able to use group by to get an RDD for each
>> unique
>> key, but it gives me an iterable.
>>
>> As in:
>>
>> val grouped = rdd.groupByKey()
>>
>> grouped.foreach{g =>
>>    val mean = g.map( x => x.foo).mean()
>>    val dev = g.map( x => x.foo ).stddev()
>>    // do fancy things with the mean and deviation
>> }
>>
>> However, there seems to be no way to convert the iterable into an RDD. Is
>> there some other technique for doing this? I'm to the point where I'm
>> considering copying and pasting the StatCollector class and changing the
>> type from Double to MyClass (or making it generic).
>>
>> Am I going down the wrong path?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Re: Computing mean and standard deviation by key

Reply via email to