subject:"code review \- counting populated columns"

Re: code review - counting populated columns

2013-11-10 Thread Patrick Wendell

Hey Philip, Your code is exactly what I was suggesting. I didn't explain it clearly, when I said emit XX, I just meant figure out how to do XX and return the result there isn't actually a function called 'emit'. In your case, the correct way to do it was using zipWithIndex... I just couldn't

code review - counting populated columns

2013-11-08 Thread Philip Ogren

Hi Spark coders, I wrote my first little Spark job that takes columnar data and counts up how many times each column is populated in an RDD. Here is the code I came up with: //RDD of List[String] corresponding to tab delimited values val columns = spark.textFile(myfile.tsv).map(line

Re: code review - counting populated columns

2013-11-08 Thread Philip Ogren

Where does 'emit' come from? I don't see it in the Scala or Spark apidocs (though I don't feel very deft at searching either!) Thanks, Philip On 11/8/2013 2:23 PM, Patrick Wendell wrote: It would be a bit more straightforward to write it like this: val columns = [same as before] val counts

Re: code review - counting populated columns

2013-11-08 Thread Tom Vacek

Your example requires each row to be exactly the same length, since zipped will truncate to the shorter of its two arguments. The second solution is elegant, but reduceByKey involves flying a bunch of data around to sort the keys. I suspect it would be a lot slower. But you could save yourself

Re: code review - counting populated columns

2013-11-08 Thread Tom Vacek

Messed up. Should be val sparseRows = spark.textFile(myfile.tsv).map(line = line.split(\t).zipWithIndex.flatMap( tt = if(tt._1.length0) (tt._2, 1) ) Then reduce with a mergeAdd. On Fri, Nov 8, 2013 at 3:35 PM, Tom Vacek minnesota...@gmail.com wrote: Your example requires each row to be

Re: code review - counting populated columns

2013-11-08 Thread Patrick Wendell

Hey Tom, reduceByKey will reduce locally on all the nodes, so there won't be any data movement except to combine totals at the end. - Patrick On Fri, Nov 8, 2013 at 1:35 PM, Tom Vacek minnesota...@gmail.com wrote: Your example requires each row to be exactly the same length, since zipped will

Re: code review - counting populated columns

2013-11-08 Thread Philip Ogren

Thank you for the pointers. I'm not sure I was able to fully understand either of your suggestions but here is what I came up with. I started with Tom's code but I think I ended up borrowing from Patrick's suggestion too. Any thoughts about my updated solution are more than welcome! I

Re: code review - counting populated columns

2013-11-08 Thread Tom Vacek

Patrick, you got me thinking, but I'm sticking to my opinion that reduceByKey should be avoided if possible. I tried some timings: def time[T](code : = T) = { val t0 = System.nanoTime : Double val res = code val t1 = System.nanoTime : Double println(Elapsed time

Re: code review - counting populated columns

code review - counting populated columns

Re: code review - counting populated columns

Re: code review - counting populated columns

Re: code review - counting populated columns

Re: code review - counting populated columns

Re: code review - counting populated columns

Re: code review - counting populated columns

8 matches

Site Navigation

Mail list logo

Footer information