Re: Surprising Spark SQL benchmark

2014-10-31 Thread Patrick Wendell
Hey Nick, Unfortunately Citus Data didn't contact any of the Spark or Spark SQL developers when running this. It is really easy to make one system look better than others when you are running a benchmark yourself because tuning and sizing can lead to a 10X performance improvement. This benchmark

Re: Surprising Spark SQL benchmark

2014-10-31 Thread Nicholas Chammas
Thanks for the response, Patrick. I guess the key takeaways are 1) the tuning/config details are everything (they're not laid out here), 2) the benchmark should be reproducible (it's not), and 3) reach out to the relevant devs before publishing (didn't happen). Probably key takeaways for any

Re: Surprising Spark SQL benchmark

2014-10-31 Thread Steve Nunez
To be fair, we (Spark community) haven’t been any better, for example this benchmark: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html For which no details or code have been released to allow others to reproduce it. I would encourage anyone doing a Spark benchmark in

Re: Surprising Spark SQL benchmark

2014-10-31 Thread Nicholas Chammas
I believe that benchmark has a pending certification on it. See http://sortbenchmark.org under Process. It's true they did not share enough details on the blog for readers to reproduce the benchmark, but they will have to share enough with the committee behind the benchmark in order to be

Re: Surprising Spark SQL benchmark

2014-10-31 Thread Kay Ousterhout
There's been an effort in the AMPLab at Berkeley to set up a shared codebase that makes it easy to run TPC-DS on SparkSQL, since it's something we do frequently in the lab to evaluate new research. Based on this thread, it sounds like making this more widely-available is something that would be

Spark consulting

2014-10-31 Thread Alessandro Baretta
Hello, Is anyone open to do some consulting work on Spark in San Mateo? Thanks. Alex

Re: Spark consulting

2014-10-31 Thread Stephen Boesch
May we please refrain from using spark mailing list for job inquiries. Thanks. 2014-10-31 13:35 GMT-07:00 Alessandro Baretta alexbare...@gmail.com: Hello, Is anyone open to do some consulting work on Spark in San Mateo? Thanks. Alex

Parquet Migrations

2014-10-31 Thread Gary Malouf
Outside of what is discussed here https://issues.apache.org/jira/browse/SPARK-3851 as a future solution, is there any path for being able to modify a Parquet schema once some data has been written? This seems like the kind of thing that should make people pause when considering whether or not to

Re: Parquet Migrations

2014-10-31 Thread Michael Armbrust
You can't change parquet schema without reencoding the data as you need to recalculate the footer index data. You can manually do what SPARK-3851 https://issues.apache.org/jira/browse/SPARK-3851 is going to do today however. Consider two schemas: Old Schema: (a: Int, b: String) New Schema,