code review - counting populated columns

Philip Ogren Fri, 08 Nov 2013 13:16:54 -0800

Hi Spark coders,

I wrote my first little Spark job that takes columnar data and counts uphow many times each column is populated in an RDD. Here is the code Icame up with:


    //RDD of List[String] corresponding to tab delimited values

val columns = spark.textFile("myfile.tsv").map(line =>line.split("\t").toList)//RDD of List[Int] corresponding to populated columns (1 forpopulated and 0 for not populated)val populatedColumns = columns.map(row => row.map(column =>if(column.length > 0) 1 else 0))

    //List[Int] contains sums of the 1's in each column

val counts = populatedColumns.reduce((row1,row2)=>(row1,row2).zipped.map(_+_))

Any thoughts about the fitness of this code snippet? I'm a littleannoyed by creating an RDD full of 1's and 0's in the second line. Theif statement feels awkward too. I was happy to find the zipped methodfor the reduce step. Any feedback you might have on how to improve thiscode is appreciated. I'm a newbie to both Scala and Spark.


Thanks,
Philip

code review - counting populated columns

Reply via email to