It would be a bit more straightforward to write it like this: val columns = [same as before]
val counts = columns.flatMap(emit (col_id, 0 or 1) for each column).reduceByKey(_+ _) Basically look at each row and emit several records using flatMap. Each record has an ID for the column (maybe its index) and a flag for whether it's present. Then you reduce by key to get the per-column count. Then you can collect at the end. - Patrick On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren <[email protected]> wrote: > Hi Spark coders, > > I wrote my first little Spark job that takes columnar data and counts up how > many times each column is populated in an RDD. Here is the code I came up > with: > > //RDD of List[String] corresponding to tab delimited values > val columns = spark.textFile("myfile.tsv").map(line => > line.split("\t").toList) > //RDD of List[Int] corresponding to populated columns (1 for populated > and 0 for not populated) > val populatedColumns = columns.map(row => row.map(column => > if(column.length > 0) 1 else 0)) > //List[Int] contains sums of the 1's in each column > val counts = populatedColumns.reduce((row1,row2) > =>(row1,row2).zipped.map(_+_)) > > Any thoughts about the fitness of this code snippet? I'm a little annoyed > by creating an RDD full of 1's and 0's in the second line. The if statement > feels awkward too. I was happy to find the zipped method for the reduce > step. Any feedback you might have on how to improve this code is > appreciated. I'm a newbie to both Scala and Spark. > > Thanks, > Philip >
