Hey Nick,
Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
developers when running this. It is really easy to make one system
look better than others when you are running a benchmark yourself
because tuning and sizing can lead to a 10X performance improvement.
This benchmark
Thanks for the response, Patrick.
I guess the key takeaways are 1) the tuning/config details are everything
(they're not laid out here), 2) the benchmark should be reproducible (it's
not), and 3) reach out to the relevant devs before publishing (didn't
happen).
Probably key takeaways for any
To be fair, we (Spark community) haven’t been any better, for example this
benchmark:
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
For which no details or code have been released to allow others to
reproduce it. I would encourage anyone doing a Spark benchmark in
I believe that benchmark has a pending certification on it. See
http://sortbenchmark.org under Process.
It's true they did not share enough details on the blog for readers to
reproduce the benchmark, but they will have to share enough with the
committee behind the benchmark in order to be
There's been an effort in the AMPLab at Berkeley to set up a shared
codebase that makes it easy to run TPC-DS on SparkSQL, since it's something
we do frequently in the lab to evaluate new research. Based on this
thread, it sounds like making this more widely-available is something that
would be
Hello,
Is anyone open to do some consulting work on Spark in San Mateo?
Thanks.
Alex
May we please refrain from using spark mailing list for job inquiries.
Thanks.
2014-10-31 13:35 GMT-07:00 Alessandro Baretta alexbare...@gmail.com:
Hello,
Is anyone open to do some consulting work on Spark in San Mateo?
Thanks.
Alex
Outside of what is discussed here
https://issues.apache.org/jira/browse/SPARK-3851 as a future solution, is
there any path for being able to modify a Parquet schema once some data has
been written? This seems like the kind of thing that should make people
pause when considering whether or not to
You can't change parquet schema without reencoding the data as you need to
recalculate the footer index data. You can manually do what SPARK-3851
https://issues.apache.org/jira/browse/SPARK-3851 is going to do today
however.
Consider two schemas:
Old Schema: (a: Int, b: String)
New Schema,