It would be a bit more straightforward to write it like this:

val columns = [same as before]

val counts = columns.flatMap(emit (col_id, 0 or 1) for each
column).reduceByKey(_+ _)

Basically look at each row and emit several records using flatMap.
Each record has an ID for the column (maybe its index) and a flag for
whether it's present.

Then you reduce by key to get the per-column count. Then you can
collect at the end.

- Patrick

On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren <[email protected]> wrote:
> Hi Spark coders,
>
> I wrote my first little Spark job that takes columnar data and counts up how
> many times each column is populated in an RDD.  Here is the code I came up
> with:
>
>     //RDD of List[String] corresponding to tab delimited values
>     val columns = spark.textFile("myfile.tsv").map(line =>
> line.split("\t").toList)
>     //RDD of List[Int] corresponding to populated columns (1 for populated
> and 0 for not populated)
>     val populatedColumns = columns.map(row => row.map(column =>
> if(column.length > 0) 1 else 0))
>     //List[Int] contains sums of the 1's in each column
>     val counts = populatedColumns.reduce((row1,row2)
> =>(row1,row2).zipped.map(_+_))
>
> Any thoughts about the fitness of this code snippet?  I'm a little annoyed
> by creating an RDD full of 1's and 0's in the second line.  The if statement
> feels awkward too.  I was happy to find the zipped method for the reduce
> step.  Any feedback you might have on how to improve this code is
> appreciated.  I'm a newbie to both Scala and Spark.
>
> Thanks,
> Philip
>

Reply via email to