Re: about aggregateByKey and standard deviation

2014-11-03 Thread Kamal Banga
I don't think directy .aggregateByKey() can be done, because we will need
count of keys (for average). Maybe we can use .countByKey() which returns a
map and .foldByKey(0)(_+_) (or aggregateByKey()) which gives sum of values
per key. I myself ain't getting how to proceed.

Regards

On Fri, Oct 31, 2014 at 1:26 PM, qinwei wei@dewmobile.net wrote:

 Hi, everyone
 I have an RDD filled with data like
 (k1, v11)
 (k1, v12)
 (k1, v13)
 (k2, v21)
 (k2, v22)
 (k2, v23)
 ...

 I want to calculate the average and standard deviation of (v11, v12,
 v13) and (v21, v22, v23) group by there keys
 for the moment, i have done that by using groupByKey and map, I notice
 that groupByKey is very expensive,  but i can not figure out how to do it
 by using aggregateByKey, so i wonder is there any better way to do this?

 Thanks!

 --
 qinwei



about aggregateByKey and standard deviation

2014-10-31 Thread qinwei






Hi, everyone    I have an RDD filled with data like        (k1, v11)        
(k1, v12)        (k1, v13)        (k2, v21)        (k2, v22)        (k2, v23)   
     ...
    I want to calculate the average and standard deviation of (v11, v12, v13) 
and (v21, v22, v23) group by there keys    for the moment, i have done that by 
using groupByKey and map, I notice that groupByKey is very expensive,  but i 
can not figure out how to do it by using aggregateByKey, so i wonder is there 
any better way to do this?
Thanks!


qinwei