Re: Surprising Spark SQL benchmark

2014-10-31 Thread Patrick Wendell
Hey Nick,

Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
developers when running this. It is really easy to make one system
look better than others when you are running a benchmark yourself
because tuning and sizing can lead to a 10X performance improvement.
This benchmark doesn't share the mechanism in a reproducible way.

There are a bunch of things that aren't clear here:

1. Spark SQL has optimized parquet features, were these turned on?
2. It doesn't mention computing statistics in Spark SQL, but it does
this for Impala and Parquet. Statistics allow Spark SQL to broadcast
small tables which can make a 10X difference in TPC-H.
3. For data larger than memory, Spark SQL often performs better if you
don't call cache, did they try this?

Basically, a self-reported marketing benchmark like this that
*shocker* concludes this vendor's solution is the best, is not
particularly useful.

If Citus data wants to run a credible benchmark, I'd invite them to
directly involve Spark SQL developers in the future. Until then, I
wouldn't give much credence to this or any other similar vendor
benchmark.

- Patrick

On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 I know we don't want to be jumping at every benchmark someone posts out
 there, but this one surprised me:

 http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style

 This benchmark has Spark SQL failing to complete several queries in the
 TPC-H benchmark. I don't understand much about the details of performing
 benchmarks, but this was surprising.

 Are these results expected?

 Related HN discussion here: https://news.ycombinator.com/item?id=8539678

 Nick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Surprising Spark SQL benchmark

2014-10-31 Thread Nicholas Chammas
Thanks for the response, Patrick.

I guess the key takeaways are 1) the tuning/config details are everything
(they're not laid out here), 2) the benchmark should be reproducible (it's
not), and 3) reach out to the relevant devs before publishing (didn't
happen).

Probably key takeaways for any kind of benchmark, really...

Nick


2014년 10월 31일 금요일, Patrick Wendellpwend...@gmail.com님이 작성한 메시지:

 Hey Nick,

 Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
 developers when running this. It is really easy to make one system
 look better than others when you are running a benchmark yourself
 because tuning and sizing can lead to a 10X performance improvement.
 This benchmark doesn't share the mechanism in a reproducible way.

 There are a bunch of things that aren't clear here:

 1. Spark SQL has optimized parquet features, were these turned on?
 2. It doesn't mention computing statistics in Spark SQL, but it does
 this for Impala and Parquet. Statistics allow Spark SQL to broadcast
 small tables which can make a 10X difference in TPC-H.
 3. For data larger than memory, Spark SQL often performs better if you
 don't call cache, did they try this?

 Basically, a self-reported marketing benchmark like this that
 *shocker* concludes this vendor's solution is the best, is not
 particularly useful.

 If Citus data wants to run a credible benchmark, I'd invite them to
 directly involve Spark SQL developers in the future. Until then, I
 wouldn't give much credence to this or any other similar vendor
 benchmark.

 - Patrick

 On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
 nicholas.cham...@gmail.com javascript:; wrote:
  I know we don't want to be jumping at every benchmark someone posts out
  there, but this one surprised me:
 
  http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
 
  This benchmark has Spark SQL failing to complete several queries in the
  TPC-H benchmark. I don't understand much about the details of performing
  benchmarks, but this was surprising.
 
  Are these results expected?
 
  Related HN discussion here: https://news.ycombinator.com/item?id=8539678
 
  Nick



Re: Surprising Spark SQL benchmark

2014-10-31 Thread Steve Nunez
To be fair, we (Spark community) haven’t been any better, for example this
benchmark:

https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html


For which no details or code have been released to allow others to
reproduce it. I would encourage anyone doing a Spark benchmark in future
to avoid the stigma of vendor reported benchmarks and publish enough
information and code to let others repeat the exercise easily.

- Steve



On 10/31/14, 11:30, Nicholas Chammas nicholas.cham...@gmail.com wrote:

Thanks for the response, Patrick.

I guess the key takeaways are 1) the tuning/config details are everything
(they're not laid out here), 2) the benchmark should be reproducible (it's
not), and 3) reach out to the relevant devs before publishing (didn't
happen).

Probably key takeaways for any kind of benchmark, really...

Nick


2014년 10월 31일 금요일, Patrick Wendellpwend...@gmail.com님이 작성한 메시지:

 Hey Nick,

 Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
 developers when running this. It is really easy to make one system
 look better than others when you are running a benchmark yourself
 because tuning and sizing can lead to a 10X performance improvement.
 This benchmark doesn't share the mechanism in a reproducible way.

 There are a bunch of things that aren't clear here:

 1. Spark SQL has optimized parquet features, were these turned on?
 2. It doesn't mention computing statistics in Spark SQL, but it does
 this for Impala and Parquet. Statistics allow Spark SQL to broadcast
 small tables which can make a 10X difference in TPC-H.
 3. For data larger than memory, Spark SQL often performs better if you
 don't call cache, did they try this?

 Basically, a self-reported marketing benchmark like this that
 *shocker* concludes this vendor's solution is the best, is not
 particularly useful.

 If Citus data wants to run a credible benchmark, I'd invite them to
 directly involve Spark SQL developers in the future. Until then, I
 wouldn't give much credence to this or any other similar vendor
 benchmark.

 - Patrick

 On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
 nicholas.cham...@gmail.com javascript:; wrote:
  I know we don't want to be jumping at every benchmark someone posts
out
  there, but this one surprised me:
 
  http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
 
  This benchmark has Spark SQL failing to complete several queries in
the
  TPC-H benchmark. I don't understand much about the details of
performing
  benchmarks, but this was surprising.
 
  Are these results expected?
 
  Related HN discussion here:
https://news.ycombinator.com/item?id=8539678
 
  Nick




-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Surprising Spark SQL benchmark

2014-10-31 Thread Nicholas Chammas
I believe that benchmark has a pending certification on it. See
http://sortbenchmark.org under Process.

It's true they did not share enough details on the blog for readers to
reproduce the benchmark, but they will have to share enough with the
committee behind the benchmark in order to be certified. Given that this is
a benchmark not many people will be able to reproduce due to size and
complexity, I don't see it as a big negative that the details are not laid
out as long as there is independent certification from a third party.

From what I've seen so far, the best big data benchmark anywhere is this:
https://amplab.cs.berkeley.edu/benchmark/

Is has all the details you'd expect, including hosted datasets, to allow
anyone to reproduce the full benchmark, covering a number of systems. I
look forward to the next update to that benchmark (a lot has changed since
Feb). And from what I can tell, it's produced by the same people behind
Spark (Patrick being among them).

So I disagree that the Spark community hasn't been any better in this
regard.

Nick


2014년 10월 31일 금요일, Steve Nunezsnu...@hortonworks.com님이 작성한 메시지:

 To be fair, we (Spark community) haven’t been any better, for example this
 benchmark:

 https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html


 For which no details or code have been released to allow others to
 reproduce it. I would encourage anyone doing a Spark benchmark in future
 to avoid the stigma of vendor reported benchmarks and publish enough
 information and code to let others repeat the exercise easily.

 - Steve



 On 10/31/14, 11:30, Nicholas Chammas nicholas.cham...@gmail.com
 javascript:; wrote:

 Thanks for the response, Patrick.
 
 I guess the key takeaways are 1) the tuning/config details are everything
 (they're not laid out here), 2) the benchmark should be reproducible (it's
 not), and 3) reach out to the relevant devs before publishing (didn't
 happen).
 
 Probably key takeaways for any kind of benchmark, really...
 
 Nick
 
 
 2014년 10월 31일 금요일, Patrick Wendellpwend...@gmail.com javascript:;님이
 작성한 메시지:
 
  Hey Nick,
 
  Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
  developers when running this. It is really easy to make one system
  look better than others when you are running a benchmark yourself
  because tuning and sizing can lead to a 10X performance improvement.
  This benchmark doesn't share the mechanism in a reproducible way.
 
  There are a bunch of things that aren't clear here:
 
  1. Spark SQL has optimized parquet features, were these turned on?
  2. It doesn't mention computing statistics in Spark SQL, but it does
  this for Impala and Parquet. Statistics allow Spark SQL to broadcast
  small tables which can make a 10X difference in TPC-H.
  3. For data larger than memory, Spark SQL often performs better if you
  don't call cache, did they try this?
 
  Basically, a self-reported marketing benchmark like this that
  *shocker* concludes this vendor's solution is the best, is not
  particularly useful.
 
  If Citus data wants to run a credible benchmark, I'd invite them to
  directly involve Spark SQL developers in the future. Until then, I
  wouldn't give much credence to this or any other similar vendor
  benchmark.
 
  - Patrick
 
  On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
  nicholas.cham...@gmail.com javascript:; javascript:; wrote:
   I know we don't want to be jumping at every benchmark someone posts
 out
   there, but this one surprised me:
  
   http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
  
   This benchmark has Spark SQL failing to complete several queries in
 the
   TPC-H benchmark. I don't understand much about the details of
 performing
   benchmarks, but this was surprising.
  
   Are these results expected?
  
   Related HN discussion here:
 https://news.ycombinator.com/item?id=8539678
  
   Nick
 



 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.



Re: Surprising Spark SQL benchmark

2014-10-31 Thread Kay Ousterhout
There's been an effort in the AMPLab at Berkeley to set up a shared
codebase that makes it easy to run TPC-DS on SparkSQL, since it's something
we do frequently in the lab to evaluate new research.  Based on this
thread, it sounds like making this more widely-available is something that
would be useful to folks for reproducing the results published by
Databricks / Hortonworks / Cloudera / etc.; we'll share the code on the
list as soon as we're done.

-Kay

On Fri, Oct 31, 2014 at 12:45 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 I believe that benchmark has a pending certification on it. See
 http://sortbenchmark.org under Process.

 It's true they did not share enough details on the blog for readers to
 reproduce the benchmark, but they will have to share enough with the
 committee behind the benchmark in order to be certified. Given that this is
 a benchmark not many people will be able to reproduce due to size and
 complexity, I don't see it as a big negative that the details are not laid
 out as long as there is independent certification from a third party.

 From what I've seen so far, the best big data benchmark anywhere is this:
 https://amplab.cs.berkeley.edu/benchmark/

 Is has all the details you'd expect, including hosted datasets, to allow
 anyone to reproduce the full benchmark, covering a number of systems. I
 look forward to the next update to that benchmark (a lot has changed since
 Feb). And from what I can tell, it's produced by the same people behind
 Spark (Patrick being among them).

 So I disagree that the Spark community hasn't been any better in this
 regard.

 Nick


 2014년 10월 31일 금요일, Steve Nunezsnu...@hortonworks.com님이 작성한 메시지:

  To be fair, we (Spark community) haven’t been any better, for example
 this
  benchmark:
 
  https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
 
 
  For which no details or code have been released to allow others to
  reproduce it. I would encourage anyone doing a Spark benchmark in future
  to avoid the stigma of vendor reported benchmarks and publish enough
  information and code to let others repeat the exercise easily.
 
  - Steve
 
 
 
  On 10/31/14, 11:30, Nicholas Chammas nicholas.cham...@gmail.com
  javascript:; wrote:
 
  Thanks for the response, Patrick.
  
  I guess the key takeaways are 1) the tuning/config details are
 everything
  (they're not laid out here), 2) the benchmark should be reproducible
 (it's
  not), and 3) reach out to the relevant devs before publishing (didn't
  happen).
  
  Probably key takeaways for any kind of benchmark, really...
  
  Nick
  
  
  2014년 10월 31일 금요일, Patrick Wendellpwend...@gmail.com javascript:;님이
  작성한 메시지:
  
   Hey Nick,
  
   Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
   developers when running this. It is really easy to make one system
   look better than others when you are running a benchmark yourself
   because tuning and sizing can lead to a 10X performance improvement.
   This benchmark doesn't share the mechanism in a reproducible way.
  
   There are a bunch of things that aren't clear here:
  
   1. Spark SQL has optimized parquet features, were these turned on?
   2. It doesn't mention computing statistics in Spark SQL, but it does
   this for Impala and Parquet. Statistics allow Spark SQL to broadcast
   small tables which can make a 10X difference in TPC-H.
   3. For data larger than memory, Spark SQL often performs better if you
   don't call cache, did they try this?
  
   Basically, a self-reported marketing benchmark like this that
   *shocker* concludes this vendor's solution is the best, is not
   particularly useful.
  
   If Citus data wants to run a credible benchmark, I'd invite them to
   directly involve Spark SQL developers in the future. Until then, I
   wouldn't give much credence to this or any other similar vendor
   benchmark.
  
   - Patrick
  
   On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
   nicholas.cham...@gmail.com javascript:; javascript:; wrote:
I know we don't want to be jumping at every benchmark someone posts
  out
there, but this one surprised me:
   
   
 http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
   
This benchmark has Spark SQL failing to complete several queries in
  the
TPC-H benchmark. I don't understand much about the details of
  performing
benchmarks, but this was surprising.
   
Are these results expected?
   
Related HN discussion here:
  https://news.ycombinator.com/item?id=8539678
   
Nick
  
 
 
 
  --
  CONFIDENTIALITY NOTICE
  NOTICE: This message is intended for the use of the individual or entity
 to
  which it is addressed and may contain information that is confidential,
  privileged and exempt from disclosure under applicable law. If the reader
  of this message is not the intended recipient, you are hereby notified
 that
  any printing, copying, dissemination, distribution, disclosure or
  

Spark consulting

2014-10-31 Thread Alessandro Baretta
Hello,

Is anyone open to do some consulting work on Spark in San Mateo?

Thanks.

Alex


Re: Spark consulting

2014-10-31 Thread Stephen Boesch
May we please refrain from using spark mailing list for job inquiries.
Thanks.

2014-10-31 13:35 GMT-07:00 Alessandro Baretta alexbare...@gmail.com:

 Hello,

 Is anyone open to do some consulting work on Spark in San Mateo?

 Thanks.

 Alex



Parquet Migrations

2014-10-31 Thread Gary Malouf
Outside of what is discussed here
https://issues.apache.org/jira/browse/SPARK-3851 as a future solution, is
there any path for being able to modify a Parquet schema once some data has
been written?  This seems like the kind of thing that should make people
pause when considering whether or not to use Parquet+Spark...


Re: Parquet Migrations

2014-10-31 Thread Michael Armbrust
You can't change parquet schema without reencoding the data as you need to
recalculate the footer index data.  You can manually do what SPARK-3851
https://issues.apache.org/jira/browse/SPARK-3851 is going to do today
however.

Consider two schemas:

Old Schema: (a: Int, b: String)
New Schema, where I've dropped and added a column: (a: Int, c: Long)

parquetFile(old).registerTempTable(old)
parquetFile(new).registerTempTable(new)

sql(
  SELECT a, b, CAST(null AS LONG) AS c  FROM old UNION ALL
  SELECT a, CAST(null AS STRING) AS b, c FROM new
).registerTempTable(unifiedData)

Because of filter/column pushdown past UNIONs this should executed as
desired even if you write more complicated queries on top of
unifiedData.  Its a little onerous but should work for now.  This can
also support things like column renaming which would be much harder to do
automatically.

On Fri, Oct 31, 2014 at 1:49 PM, Gary Malouf malouf.g...@gmail.com wrote:

 Outside of what is discussed here
 https://issues.apache.org/jira/browse/SPARK-3851 as a future solution,
 is
 there any path for being able to modify a Parquet schema once some data has
 been written?  This seems like the kind of thing that should make people
 pause when considering whether or not to use Parquet+Spark...