Hi Spark coders,

I wrote my first little Spark job that takes columnar data and counts up how many times each column is populated in an RDD. Here is the code I came up with:

    //RDD of List[String] corresponding to tab delimited values
val columns = spark.textFile("myfile.tsv").map(line => line.split("\t").toList) //RDD of List[Int] corresponding to populated columns (1 for populated and 0 for not populated) val populatedColumns = columns.map(row => row.map(column => if(column.length > 0) 1 else 0))
    //List[Int] contains sums of the 1's in each column
val counts = populatedColumns.reduce((row1,row2) =>(row1,row2).zipped.map(_+_))

Any thoughts about the fitness of this code snippet? I'm a little annoyed by creating an RDD full of 1's and 0's in the second line. The if statement feels awkward too. I was happy to find the zipped method for the reduce step. Any feedback you might have on how to improve this code is appreciated. I'm a newbie to both Scala and Spark.

Thanks,
Philip

Reply via email to