Re: Spark Summit agenda posted

2013-11-08 Thread R. Revert
would the pdf slider´s or video talks, be posted for the people that can't attend to the conference? thanks *#__* *Atte.* *Rafael R.* 2013/11/7 Matei Zaharia matei.zaha...@gmail.com Hi everyone, We're glad to announce the agenda of the Spark Summit, which will happen

Not caching rdds, spark.storage.memoryFraction setting

2013-11-08 Thread Grega Kešpret
Hi, The docs say: Fraction of Java heap to use for Spark's memory cache. This should not be larger than the old generation of objects in the JVM, which by default is given 2/3 of the heap, but you can increase it if you configure your own old generation size. if we are not caching any RDDs, does

Random StreamCorruptedException during task execution

2013-11-08 Thread Guillaume Pitel
Hi dear Spark users developpers, I've stumbled on a problem that seems to occur randomly. My tasks sometimes (60%) fail with errors like that : java.io.StreamCorruptedException: invalid handle value: 006E0007 at java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1454) at

Re: Spark Summit agenda posted

2013-11-08 Thread Matei Zaharia
Yes, we do plan to make them available after. Matei On Nov 8, 2013, at 3:27 AM, R. Revert rafarevert...@gmail.com wrote: would the pdf slider´s or video talks, be posted for the people that can't attend to the conference? thanks #__ Atte. Rafael R.

Re: cluster hangs for no apparent reason

2013-11-08 Thread Walrus theCat
Got it. Thanks, that clarifies. On Thu, Nov 7, 2013 at 3:34 PM, Shangyu Luo lsy...@gmail.com wrote: I am not sure. But in their RDD paper they have mentioned the usage of broadcast variable. Sometimes you may need local variable in many map-reduce jobs and you do not want to copy them to

code review - counting populated columns

2013-11-08 Thread Philip Ogren
Hi Spark coders, I wrote my first little Spark job that takes columnar data and counts up how many times each column is populated in an RDD. Here is the code I came up with: //RDD of List[String] corresponding to tab delimited values val columns = spark.textFile(myfile.tsv).map(line

Re: code review - counting populated columns

2013-11-08 Thread Philip Ogren
Where does 'emit' come from? I don't see it in the Scala or Spark apidocs (though I don't feel very deft at searching either!) Thanks, Philip On 11/8/2013 2:23 PM, Patrick Wendell wrote: It would be a bit more straightforward to write it like this: val columns = [same as before] val counts

Re: code review - counting populated columns

2013-11-08 Thread Tom Vacek
Your example requires each row to be exactly the same length, since zipped will truncate to the shorter of its two arguments. The second solution is elegant, but reduceByKey involves flying a bunch of data around to sort the keys. I suspect it would be a lot slower. But you could save yourself

Re: code review - counting populated columns

2013-11-08 Thread Tom Vacek
Messed up. Should be val sparseRows = spark.textFile(myfile.tsv).map(line = line.split(\t).zipWithIndex.flatMap( tt = if(tt._1.length0) (tt._2, 1) ) Then reduce with a mergeAdd. On Fri, Nov 8, 2013 at 3:35 PM, Tom Vacek minnesota...@gmail.com wrote: Your example requires each row to be

Re: code review - counting populated columns

2013-11-08 Thread Patrick Wendell
Hey Tom, reduceByKey will reduce locally on all the nodes, so there won't be any data movement except to combine totals at the end. - Patrick On Fri, Nov 8, 2013 at 1:35 PM, Tom Vacek minnesota...@gmail.com wrote: Your example requires each row to be exactly the same length, since zipped will

Re: Not caching rdds, spark.storage.memoryFraction setting

2013-11-08 Thread Christopher Nguyen
Grega, the way to think about this setting is that it sets the maximum amount of memory Spark is allowed to use for caching RDDs before it must expire or spill them to disk. Spark in principle knows at all times how many RDDs are kept in memory and their total sizes, so it can for example persist

Re: code review - counting populated columns

2013-11-08 Thread Philip Ogren
Thank you for the pointers. I'm not sure I was able to fully understand either of your suggestions but here is what I came up with. I started with Tom's code but I think I ended up borrowing from Patrick's suggestion too. Any thoughts about my updated solution are more than welcome! I

lzo read in spark

2013-11-08 Thread Rajeev Srivastava
Hi , Has someone successfully read a lzo file in spark? regards Rajeev Srivastava Silverline Design Inc 2118 Walsh ave, suite 204 Santa Clara, CA, 95050 cell : 408-409-0940

Re: code review - counting populated columns

2013-11-08 Thread Tom Vacek
Patrick, you got me thinking, but I'm sticking to my opinion that reduceByKey should be avoided if possible. I tried some timings: def time[T](code : = T) = { val t0 = System.nanoTime : Double val res = code val t1 = System.nanoTime : Double println(Elapsed time