Hi Spark coders,
I wrote my first little Spark job that takes columnar data and counts up
how many times each column is populated in an RDD. Here is the code I
came up with:
//RDD of List[String] corresponding to tab delimited values
val columns = spark.textFile("myfile.tsv").map(line =>
line.split("\t").toList)
//RDD of List[Int] corresponding to populated columns (1 for
populated and 0 for not populated)
val populatedColumns = columns.map(row => row.map(column =>
if(column.length > 0) 1 else 0))
//List[Int] contains sums of the 1's in each column
val counts = populatedColumns.reduce((row1,row2)
=>(row1,row2).zipped.map(_+_))
Any thoughts about the fitness of this code snippet? I'm a little
annoyed by creating an RDD full of 1's and 0's in the second line. The
if statement feels awkward too. I was happy to find the zipped method
for the reduce step. Any feedback you might have on how to improve this
code is appreciated. I'm a newbie to both Scala and Spark.
Thanks,
Philip