Hey Philip,
Your code is exactly what I was suggesting. I didn't explain it
clearly, when I said emit XX, I just meant figure out how to do XX
and return the result there isn't actually a function called 'emit'.
In your case, the correct way to do it was using zipWithIndex... I
just couldn't
Hi Spark coders,
I wrote my first little Spark job that takes columnar data and counts up
how many times each column is populated in an RDD. Here is the code I
came up with:
//RDD of List[String] corresponding to tab delimited values
val columns = spark.textFile(myfile.tsv).map(line
Where does 'emit' come from? I don't see it in the Scala or Spark
apidocs (though I don't feel very deft at searching either!)
Thanks,
Philip
On 11/8/2013 2:23 PM, Patrick Wendell wrote:
It would be a bit more straightforward to write it like this:
val columns = [same as before]
val counts
Your example requires each row to be exactly the same length, since zipped
will truncate to the shorter of its two arguments.
The second solution is elegant, but reduceByKey involves flying a bunch of
data around to sort the keys. I suspect it would be a lot slower. But you
could save yourself
Messed up. Should be
val sparseRows = spark.textFile(myfile.tsv).map(line =
line.split(\t).zipWithIndex.flatMap( tt = if(tt._1.length0) (tt._2, 1) )
Then reduce with a mergeAdd.
On Fri, Nov 8, 2013 at 3:35 PM, Tom Vacek minnesota...@gmail.com wrote:
Your example requires each row to be
Hey Tom,
reduceByKey will reduce locally on all the nodes, so there won't be
any data movement except to combine totals at the end.
- Patrick
On Fri, Nov 8, 2013 at 1:35 PM, Tom Vacek minnesota...@gmail.com wrote:
Your example requires each row to be exactly the same length, since zipped
will
Thank you for the pointers. I'm not sure I was able to fully understand
either of your suggestions but here is what I came up with. I started
with Tom's code but I think I ended up borrowing from Patrick's
suggestion too. Any thoughts about my updated solution are more than
welcome! I
Patrick, you got me thinking, but I'm sticking to my opinion that
reduceByKey should be avoided if possible. I tried some timings:
def time[T](code : = T) = {
val t0 = System.nanoTime : Double
val res = code
val t1 = System.nanoTime : Double
println(Elapsed time