Re: code review - counting populated columns

2013-11-10 Thread Patrick Wendell
Hey Philip, Your code is exactly what I was suggesting. I didn't explain it clearly, when I said emit XX, I just meant figure out how to do XX and return the result there isn't actually a function called 'emit'. In your case, the correct way to do it was using zipWithIndex... I just couldn't

code review - counting populated columns

2013-11-08 Thread Philip Ogren
Hi Spark coders, I wrote my first little Spark job that takes columnar data and counts up how many times each column is populated in an RDD. Here is the code I came up with: //RDD of List[String] corresponding to tab delimited values val columns = spark.textFile(myfile.tsv).map(line

Re: code review - counting populated columns

2013-11-08 Thread Philip Ogren
Where does 'emit' come from? I don't see it in the Scala or Spark apidocs (though I don't feel very deft at searching either!) Thanks, Philip On 11/8/2013 2:23 PM, Patrick Wendell wrote: It would be a bit more straightforward to write it like this: val columns = [same as before] val counts

Re: code review - counting populated columns

2013-11-08 Thread Tom Vacek
Your example requires each row to be exactly the same length, since zipped will truncate to the shorter of its two arguments. The second solution is elegant, but reduceByKey involves flying a bunch of data around to sort the keys. I suspect it would be a lot slower. But you could save yourself

Re: code review - counting populated columns

2013-11-08 Thread Tom Vacek
Messed up. Should be val sparseRows = spark.textFile(myfile.tsv).map(line = line.split(\t).zipWithIndex.flatMap( tt = if(tt._1.length0) (tt._2, 1) ) Then reduce with a mergeAdd. On Fri, Nov 8, 2013 at 3:35 PM, Tom Vacek minnesota...@gmail.com wrote: Your example requires each row to be

Re: code review - counting populated columns

2013-11-08 Thread Patrick Wendell
Hey Tom, reduceByKey will reduce locally on all the nodes, so there won't be any data movement except to combine totals at the end. - Patrick On Fri, Nov 8, 2013 at 1:35 PM, Tom Vacek minnesota...@gmail.com wrote: Your example requires each row to be exactly the same length, since zipped will

Re: code review - counting populated columns

2013-11-08 Thread Philip Ogren
Thank you for the pointers. I'm not sure I was able to fully understand either of your suggestions but here is what I came up with. I started with Tom's code but I think I ended up borrowing from Patrick's suggestion too. Any thoughts about my updated solution are more than welcome! I

Re: code review - counting populated columns

2013-11-08 Thread Tom Vacek
Patrick, you got me thinking, but I'm sticking to my opinion that reduceByKey should be avoided if possible. I tried some timings: def time[T](code : = T) = { val t0 = System.nanoTime : Double val res = code val t1 = System.nanoTime : Double println(Elapsed time