Re: Surprising Spark SQL benchmark
Hey Nick, Unfortunately Citus Data didn't contact any of the Spark or Spark SQL developers when running this. It is really easy to make one system look better than others when you are running a benchmark yourself because tuning and sizing can lead to a 10X performance improvement. This benchmark doesn't share the mechanism in a reproducible way. There are a bunch of things that aren't clear here: 1. Spark SQL has optimized parquet features, were these turned on? 2. It doesn't mention computing statistics in Spark SQL, but it does this for Impala and Parquet. Statistics allow Spark SQL to broadcast small tables which can make a 10X difference in TPC-H. 3. For data larger than memory, Spark SQL often performs better if you don't call cache, did they try this? Basically, a self-reported marketing benchmark like this that *shocker* concludes this vendor's solution is the best, is not particularly useful. If Citus data wants to run a credible benchmark, I'd invite them to directly involve Spark SQL developers in the future. Until then, I wouldn't give much credence to this or any other similar vendor benchmark. - Patrick On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I know we don't want to be jumping at every benchmark someone posts out there, but this one surprised me: http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style This benchmark has Spark SQL failing to complete several queries in the TPC-H benchmark. I don't understand much about the details of performing benchmarks, but this was surprising. Are these results expected? Related HN discussion here: https://news.ycombinator.com/item?id=8539678 Nick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Surprising Spark SQL benchmark
Thanks for the response, Patrick. I guess the key takeaways are 1) the tuning/config details are everything (they're not laid out here), 2) the benchmark should be reproducible (it's not), and 3) reach out to the relevant devs before publishing (didn't happen). Probably key takeaways for any kind of benchmark, really... Nick 2014년 10월 31일 금요일, Patrick Wendellpwend...@gmail.com님이 작성한 메시지: Hey Nick, Unfortunately Citus Data didn't contact any of the Spark or Spark SQL developers when running this. It is really easy to make one system look better than others when you are running a benchmark yourself because tuning and sizing can lead to a 10X performance improvement. This benchmark doesn't share the mechanism in a reproducible way. There are a bunch of things that aren't clear here: 1. Spark SQL has optimized parquet features, were these turned on? 2. It doesn't mention computing statistics in Spark SQL, but it does this for Impala and Parquet. Statistics allow Spark SQL to broadcast small tables which can make a 10X difference in TPC-H. 3. For data larger than memory, Spark SQL often performs better if you don't call cache, did they try this? Basically, a self-reported marketing benchmark like this that *shocker* concludes this vendor's solution is the best, is not particularly useful. If Citus data wants to run a credible benchmark, I'd invite them to directly involve Spark SQL developers in the future. Until then, I wouldn't give much credence to this or any other similar vendor benchmark. - Patrick On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas nicholas.cham...@gmail.com javascript:; wrote: I know we don't want to be jumping at every benchmark someone posts out there, but this one surprised me: http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style This benchmark has Spark SQL failing to complete several queries in the TPC-H benchmark. I don't understand much about the details of performing benchmarks, but this was surprising. Are these results expected? Related HN discussion here: https://news.ycombinator.com/item?id=8539678 Nick
Re: Surprising Spark SQL benchmark
To be fair, we (Spark community) haven’t been any better, for example this benchmark: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html For which no details or code have been released to allow others to reproduce it. I would encourage anyone doing a Spark benchmark in future to avoid the stigma of vendor reported benchmarks and publish enough information and code to let others repeat the exercise easily. - Steve On 10/31/14, 11:30, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for the response, Patrick. I guess the key takeaways are 1) the tuning/config details are everything (they're not laid out here), 2) the benchmark should be reproducible (it's not), and 3) reach out to the relevant devs before publishing (didn't happen). Probably key takeaways for any kind of benchmark, really... Nick 2014년 10월 31일 금요일, Patrick Wendellpwend...@gmail.com님이 작성한 메시지: Hey Nick, Unfortunately Citus Data didn't contact any of the Spark or Spark SQL developers when running this. It is really easy to make one system look better than others when you are running a benchmark yourself because tuning and sizing can lead to a 10X performance improvement. This benchmark doesn't share the mechanism in a reproducible way. There are a bunch of things that aren't clear here: 1. Spark SQL has optimized parquet features, were these turned on? 2. It doesn't mention computing statistics in Spark SQL, but it does this for Impala and Parquet. Statistics allow Spark SQL to broadcast small tables which can make a 10X difference in TPC-H. 3. For data larger than memory, Spark SQL often performs better if you don't call cache, did they try this? Basically, a self-reported marketing benchmark like this that *shocker* concludes this vendor's solution is the best, is not particularly useful. If Citus data wants to run a credible benchmark, I'd invite them to directly involve Spark SQL developers in the future. Until then, I wouldn't give much credence to this or any other similar vendor benchmark. - Patrick On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas nicholas.cham...@gmail.com javascript:; wrote: I know we don't want to be jumping at every benchmark someone posts out there, but this one surprised me: http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style This benchmark has Spark SQL failing to complete several queries in the TPC-H benchmark. I don't understand much about the details of performing benchmarks, but this was surprising. Are these results expected? Related HN discussion here: https://news.ycombinator.com/item?id=8539678 Nick -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Surprising Spark SQL benchmark
I believe that benchmark has a pending certification on it. See http://sortbenchmark.org under Process. It's true they did not share enough details on the blog for readers to reproduce the benchmark, but they will have to share enough with the committee behind the benchmark in order to be certified. Given that this is a benchmark not many people will be able to reproduce due to size and complexity, I don't see it as a big negative that the details are not laid out as long as there is independent certification from a third party. From what I've seen so far, the best big data benchmark anywhere is this: https://amplab.cs.berkeley.edu/benchmark/ Is has all the details you'd expect, including hosted datasets, to allow anyone to reproduce the full benchmark, covering a number of systems. I look forward to the next update to that benchmark (a lot has changed since Feb). And from what I can tell, it's produced by the same people behind Spark (Patrick being among them). So I disagree that the Spark community hasn't been any better in this regard. Nick 2014년 10월 31일 금요일, Steve Nunezsnu...@hortonworks.com님이 작성한 메시지: To be fair, we (Spark community) haven’t been any better, for example this benchmark: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html For which no details or code have been released to allow others to reproduce it. I would encourage anyone doing a Spark benchmark in future to avoid the stigma of vendor reported benchmarks and publish enough information and code to let others repeat the exercise easily. - Steve On 10/31/14, 11:30, Nicholas Chammas nicholas.cham...@gmail.com javascript:; wrote: Thanks for the response, Patrick. I guess the key takeaways are 1) the tuning/config details are everything (they're not laid out here), 2) the benchmark should be reproducible (it's not), and 3) reach out to the relevant devs before publishing (didn't happen). Probably key takeaways for any kind of benchmark, really... Nick 2014년 10월 31일 금요일, Patrick Wendellpwend...@gmail.com javascript:;님이 작성한 메시지: Hey Nick, Unfortunately Citus Data didn't contact any of the Spark or Spark SQL developers when running this. It is really easy to make one system look better than others when you are running a benchmark yourself because tuning and sizing can lead to a 10X performance improvement. This benchmark doesn't share the mechanism in a reproducible way. There are a bunch of things that aren't clear here: 1. Spark SQL has optimized parquet features, were these turned on? 2. It doesn't mention computing statistics in Spark SQL, but it does this for Impala and Parquet. Statistics allow Spark SQL to broadcast small tables which can make a 10X difference in TPC-H. 3. For data larger than memory, Spark SQL often performs better if you don't call cache, did they try this? Basically, a self-reported marketing benchmark like this that *shocker* concludes this vendor's solution is the best, is not particularly useful. If Citus data wants to run a credible benchmark, I'd invite them to directly involve Spark SQL developers in the future. Until then, I wouldn't give much credence to this or any other similar vendor benchmark. - Patrick On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas nicholas.cham...@gmail.com javascript:; javascript:; wrote: I know we don't want to be jumping at every benchmark someone posts out there, but this one surprised me: http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style This benchmark has Spark SQL failing to complete several queries in the TPC-H benchmark. I don't understand much about the details of performing benchmarks, but this was surprising. Are these results expected? Related HN discussion here: https://news.ycombinator.com/item?id=8539678 Nick -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Surprising Spark SQL benchmark
There's been an effort in the AMPLab at Berkeley to set up a shared codebase that makes it easy to run TPC-DS on SparkSQL, since it's something we do frequently in the lab to evaluate new research. Based on this thread, it sounds like making this more widely-available is something that would be useful to folks for reproducing the results published by Databricks / Hortonworks / Cloudera / etc.; we'll share the code on the list as soon as we're done. -Kay On Fri, Oct 31, 2014 at 12:45 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I believe that benchmark has a pending certification on it. See http://sortbenchmark.org under Process. It's true they did not share enough details on the blog for readers to reproduce the benchmark, but they will have to share enough with the committee behind the benchmark in order to be certified. Given that this is a benchmark not many people will be able to reproduce due to size and complexity, I don't see it as a big negative that the details are not laid out as long as there is independent certification from a third party. From what I've seen so far, the best big data benchmark anywhere is this: https://amplab.cs.berkeley.edu/benchmark/ Is has all the details you'd expect, including hosted datasets, to allow anyone to reproduce the full benchmark, covering a number of systems. I look forward to the next update to that benchmark (a lot has changed since Feb). And from what I can tell, it's produced by the same people behind Spark (Patrick being among them). So I disagree that the Spark community hasn't been any better in this regard. Nick 2014년 10월 31일 금요일, Steve Nunezsnu...@hortonworks.com님이 작성한 메시지: To be fair, we (Spark community) haven’t been any better, for example this benchmark: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html For which no details or code have been released to allow others to reproduce it. I would encourage anyone doing a Spark benchmark in future to avoid the stigma of vendor reported benchmarks and publish enough information and code to let others repeat the exercise easily. - Steve On 10/31/14, 11:30, Nicholas Chammas nicholas.cham...@gmail.com javascript:; wrote: Thanks for the response, Patrick. I guess the key takeaways are 1) the tuning/config details are everything (they're not laid out here), 2) the benchmark should be reproducible (it's not), and 3) reach out to the relevant devs before publishing (didn't happen). Probably key takeaways for any kind of benchmark, really... Nick 2014년 10월 31일 금요일, Patrick Wendellpwend...@gmail.com javascript:;님이 작성한 메시지: Hey Nick, Unfortunately Citus Data didn't contact any of the Spark or Spark SQL developers when running this. It is really easy to make one system look better than others when you are running a benchmark yourself because tuning and sizing can lead to a 10X performance improvement. This benchmark doesn't share the mechanism in a reproducible way. There are a bunch of things that aren't clear here: 1. Spark SQL has optimized parquet features, were these turned on? 2. It doesn't mention computing statistics in Spark SQL, but it does this for Impala and Parquet. Statistics allow Spark SQL to broadcast small tables which can make a 10X difference in TPC-H. 3. For data larger than memory, Spark SQL often performs better if you don't call cache, did they try this? Basically, a self-reported marketing benchmark like this that *shocker* concludes this vendor's solution is the best, is not particularly useful. If Citus data wants to run a credible benchmark, I'd invite them to directly involve Spark SQL developers in the future. Until then, I wouldn't give much credence to this or any other similar vendor benchmark. - Patrick On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas nicholas.cham...@gmail.com javascript:; javascript:; wrote: I know we don't want to be jumping at every benchmark someone posts out there, but this one surprised me: http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style This benchmark has Spark SQL failing to complete several queries in the TPC-H benchmark. I don't understand much about the details of performing benchmarks, but this was surprising. Are these results expected? Related HN discussion here: https://news.ycombinator.com/item?id=8539678 Nick -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or
Spark consulting
Hello, Is anyone open to do some consulting work on Spark in San Mateo? Thanks. Alex
Re: Spark consulting
May we please refrain from using spark mailing list for job inquiries. Thanks. 2014-10-31 13:35 GMT-07:00 Alessandro Baretta alexbare...@gmail.com: Hello, Is anyone open to do some consulting work on Spark in San Mateo? Thanks. Alex
Parquet Migrations
Outside of what is discussed here https://issues.apache.org/jira/browse/SPARK-3851 as a future solution, is there any path for being able to modify a Parquet schema once some data has been written? This seems like the kind of thing that should make people pause when considering whether or not to use Parquet+Spark...
Re: Parquet Migrations
You can't change parquet schema without reencoding the data as you need to recalculate the footer index data. You can manually do what SPARK-3851 https://issues.apache.org/jira/browse/SPARK-3851 is going to do today however. Consider two schemas: Old Schema: (a: Int, b: String) New Schema, where I've dropped and added a column: (a: Int, c: Long) parquetFile(old).registerTempTable(old) parquetFile(new).registerTempTable(new) sql( SELECT a, b, CAST(null AS LONG) AS c FROM old UNION ALL SELECT a, CAST(null AS STRING) AS b, c FROM new ).registerTempTable(unifiedData) Because of filter/column pushdown past UNIONs this should executed as desired even if you write more complicated queries on top of unifiedData. Its a little onerous but should work for now. This can also support things like column renaming which would be much harder to do automatically. On Fri, Oct 31, 2014 at 1:49 PM, Gary Malouf malouf.g...@gmail.com wrote: Outside of what is discussed here https://issues.apache.org/jira/browse/SPARK-3851 as a future solution, is there any path for being able to modify a Parquet schema once some data has been written? This seems like the kind of thing that should make people pause when considering whether or not to use Parquet+Spark...