How to access the individual elements of RDD[Iterable[Float]] to do sum(),stdev() ?

KRaman Wed, 13 Aug 2014 16:20:07 -0700

Hello all, 

I’m beginner in Spark and Scala. I have the following code which does a
groupBy on 2 keys.


val rdd2 = rdd1.groupBy(x => (x._2._1._1, x._2._1._2)) 

rdd1 looks like below: rdd1 is obtained as a result of left outer join
between 2 RDDs.  

Class[_ <: org.apache.spark.rdd.RDD[ ((Int, Int), ((Int, String, Float, Int,
String, Int, Int, Int, Int, Int, Float, Float, Float, Float, Float),
Option[Int])) ]] = class org.apache.spark.rdd.FlatMappedValuesRDD

((447757,-312112),((101,621DA3,15.76,182,2014-08-02
13:51:35.8790000,6,1351,1,1,1,15.974019,1.0,0.066,-0.256,2.075964),Some(4)))
((447732,-323219),((102,721DA3,9.263333,187,2014-08-10
13:51:48.8790000,6,1351,1,1,1,16.147541,2.0,-1.288333,0.120833,-54.37624),None))
((447758,-312112),((101,621DA3,15.694,182,2014-08-02
13:51:34.8790000,6,1351,1,1,1,18.05863,1.0,0.322,0.008,10.003252),Some(1)))
((447763,-312113),((101,621DA3,13.98,189,2014-08-02
13:51:29.8790000,6,1351,1,1,1,14.02744,1.0,0.052,-0.492,1.451216),Some(2)))
((447751,-312110),((102,721DA3,14.318,200,2014-08-10
13:51:41.8790000,6,1351,1,1,1,14.419073,1.0,-0.578,-0.198,-16.885693),Some(1)))

The keys on which groupBy is done are the contents of x._2._1._1, x._2._1._2 
i.e (101 or 102) and (521DA3
 or 621DA3 or 721DA3) 


rdd2 looks like this:  

( (101,521DA3), 
ArrayBuffer(((447770,-312487),((101,521DA3,3.52,195,2014-08-10
13:51:06.8790000,6,1351,1,1,1,0.0,0.0,0.0,0.0,0.0),Some(4))),
((447769,-312489),((101,521DA3,4.89,203,2014-08-10
13:51:09.8790000,6,1351,1,1,1,13.48127,3.0,0.456667,0.152222,11.521707),Some(4))),
((447767,-312111),((101,521DA3,11.1675,193,2014-08-10
13:51:24.8790000,6,1351,1,1,1,19.49997,2.0,1.550417,0.332292,59.641956),Some(4))),
((447768,-312489),((101,521DA3,8.066666,201,2014-08-10
13:51:22.8790000,6,1351,1,1,1,13.649966,2.0,0.885833,0.186354,25.444075),Some(4))),
((447769,-312488),((101,521DA3,2.19,245,2014-08-10
13:51:12.8790000,6,1351,1,1,1,11.013295,3.0,-0.9,-0.452222,-19.116),Some(1))),
((447766,-312112),((101,521DA3,12.6175,192,2014-08-10
13:51:25.8790000,6,1351,1,1,1,11.341975,1.0,1.45,-0.100417,34.48825),Some(4))),
((447768,-312110),((101,521DA3,6.295,219,2014-08-10
13:51:20.8790000,6,1351,1,1,1,12.467666,8.0,0.513125,0.176641,34.830925),Some(2)))))

I want to perform some operations like standard deviation (stdev) and (sum)
on the contents of ArrayBuffer depending on certain conditions (i.e if
Some(4) or Some(2) exists ). 

I’m able to access the elements like this: 

val rdd3 = rdd2.map(group => (group._2.map(_._2._1._3))).foreach(println)

rdd3: org.apache.spark.rdd.RDD[Iterable[Float]]

ArrayBuffer(3.52, 4.89, 11.1675, 8.066666, 2.19, 12.6175, 6.295)
ArrayBuffer(9.263333, 14.318, 11.84, 15.508, 15.82, 7.35, 13.37, 13.478,
13.802, 14.896, 14.126, 15.276)
ArrayBuffer(15.76, 15.694, 13.98, 15.058, 14.38, 14.67, 13.928, 15.372,
12.9, 13.384, 16.024)

val v1 = rdd3.take(1)

v1:  ArrayBuffer(3.52, 4.89, 11.1675, 8.066666, 2.19, 12.6175, 6.295) 

Now I need to find the sum, stdev of the numbers in  ArrayBuffer(3.52, 4.89,
11.1675, 8.066666, 2.19, 12.6175, 6.295). 
My question is how I can access the contents of ArrayBuffer ? 
Or is there a way to create an RDD which will consist of the contents of
ArrayBuffer so that I can easily perform the operations (sum,stdev etc). 

Any help regarding this would be extremely helpful. 
Thanks a lot for your time,
Krishna 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-the-individual-elements-of-RDD-Iterable-Float-to-do-sum-stdev-tp12073.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to access the individual elements of RDD[Iterable[Float]] to do sum(),stdev() ?

Reply via email to