Where does 'emit' come from? I don't see it in the Scala or Spark
apidocs (though I don't feel very deft at searching either!)
Thanks,
Philip
On 11/8/2013 2:23 PM, Patrick Wendell wrote:
It would be a bit more straightforward to write it like this:
val columns = [same as before]
val counts = columns.flatMap(emit (col_id, 0 or 1) for each
column).reduceByKey(_+ _)
Basically look at each row and emit several records using flatMap.
Each record has an ID for the column (maybe its index) and a flag for
whether it's present.
Then you reduce by key to get the per-column count. Then you can
collect at the end.
- Patrick
On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren <[email protected]> wrote:
Hi Spark coders,
I wrote my first little Spark job that takes columnar data and counts up how
many times each column is populated in an RDD. Here is the code I came up
with:
//RDD of List[String] corresponding to tab delimited values
val columns = spark.textFile("myfile.tsv").map(line =>
line.split("\t").toList)
//RDD of List[Int] corresponding to populated columns (1 for populated
and 0 for not populated)
val populatedColumns = columns.map(row => row.map(column =>
if(column.length > 0) 1 else 0))
//List[Int] contains sums of the 1's in each column
val counts = populatedColumns.reduce((row1,row2)
=>(row1,row2).zipped.map(_+_))
Any thoughts about the fitness of this code snippet? I'm a little annoyed
by creating an RDD full of 1's and 0's in the second line. The if statement
feels awkward too. I was happy to find the zipped method for the reduce
step. Any feedback you might have on how to improve this code is
appreciated. I'm a newbie to both Scala and Spark.
Thanks,
Philip