Re: code review - counting populated columns

Philip Ogren Fri, 08 Nov 2013 13:33:26 -0800

Where does 'emit' come from? I don't see it in the Scala or Sparkapidocs (though I don't feel very deft at searching either!)


Thanks,
Philip


On 11/8/2013 2:23 PM, Patrick Wendell wrote:

It would be a bit more straightforward to write it like this:

val columns = [same as before]

val counts = columns.flatMap(emit (col_id, 0 or 1) for each
column).reduceByKey(_+ _)

Basically look at each row and emit several records using flatMap.
Each record has an ID for the column (maybe its index) and a flag for
whether it's present.

Then you reduce by key to get the per-column count. Then you can
collect at the end.

- Patrick

On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren <[email protected]> wrote:

Hi Spark coders,

I wrote my first little Spark job that takes columnar data and counts up how
many times each column is populated in an RDD.  Here is the code I came up
with:

     //RDD of List[String] corresponding to tab delimited values
     val columns = spark.textFile("myfile.tsv").map(line =>
line.split("\t").toList)
     //RDD of List[Int] corresponding to populated columns (1 for populated
and 0 for not populated)
     val populatedColumns = columns.map(row => row.map(column =>
if(column.length > 0) 1 else 0))
     //List[Int] contains sums of the 1's in each column
     val counts = populatedColumns.reduce((row1,row2)
=>(row1,row2).zipped.map(_+_))

Any thoughts about the fitness of this code snippet?  I'm a little annoyed
by creating an RDD full of 1's and 0's in the second line.  The if statement
feels awkward too.  I was happy to find the zipped method for the reduce
step.  Any feedback you might have on how to improve this code is
appreciated.  I'm a newbie to both Scala and Spark.

Thanks,
Philip

Re: code review - counting populated columns

Reply via email to