Hi Patrick,
We left the details of the configuration of Spark that we used out of the
blog post for brevity, but we're happy to share them. We've done quite a
bit of tuning to find the configuration settings that gave us the best
query times and run the most queries. I think there might still be
On Fri, Oct 31, 2014 at 3:45 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
I believe that benchmark has a pending certification on it. See
http://sortbenchmark.org under Process.
Regarding this comment, Reynold has just announced that this benchmark is
now certified.
-
Steve Nunez, I believe the information behind the links below should
address your concerns earlier about Databricks's submission to the Daytona
Gray benchmark.
On Wed, Nov 5, 2014 at 6:43 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
On Fri, Oct 31, 2014 at 3:45 PM, Nicholas Chammas
: Wednesday, November 5, 2014 at 15:56
To: Steve Nunez snu...@hortonworks.com
Cc: Patrick Wendell pwend...@gmail.com, dev dev@spark.apache.org
Subject: Re: Surprising Spark SQL benchmark
Steve Nunez, I believe the information behind the links below should
address
your concerns earlier about
: Surprising Spark SQL benchmark
Steve Nunez, I believe the information behind the links below should
address
your concerns earlier about Databricks's submission to the Daytona Gray
benchmark.
On Wed, Nov 5, 2014 at 6:43 PM, Nicholas Chammas
nicholas.cham...@gmail.com
wrote:
On Fri
by a 2001 Toyota Celica.
- Steve
From: Nicholas Chammas nicholas.cham...@gmail.com
Date: Wednesday, November 5, 2014 at 15:56
To: Steve Nunez snu...@hortonworks.com
Cc: Patrick Wendell pwend...@gmail.com, dev dev@spark.apache.org
Subject: Re: Surprising Spark SQL benchmark
Steve Nunez
to be involved with the community in re-running the numbers.
Is this email thread the best place to continue the conversation?
Best,
Ozgun
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Surprising-Spark-SQL-benchmark-tp9041p9073.html
Sent from
applied and missed, we'd
love to be involved with the community in re-running the numbers.
Is this email thread the best place to continue the conversation?
Best,
Ozgun
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Surprising-Spark-SQL-benchmark
Hi Key,
Thank you so much for your update!!
Look forward to the shared code from AMPLab. As a member of the Spark
community, I really hope that I could help to run TPC-DS on SparkSQL. At the
moment, I am trying TPC-H 22 queries on SparkSQL 1.1.0 +Hive 0.12, and Hive
0.13.1 respectively
Two thoughts here:
1. The real flaw with the sort benchmark was that Hadoop wasn't run on the
same hardware. Given the advances in networking (availabIlity of
10GB Ethernet) and disks (SSDs) since the Hadoop benchmarks it was compared
to, it's an apples to oranges comparison. Without that, it
Good points raised. Some comments.
Re: #1
It seems like there is a misunderstanding of the purpose of the Daytona
Gray benchmark. The purpose of the benchmark is to see how fast you can
sort 100 TB of data (technically, your sort rate during the operation)
using *any* hardware or software
Kay,
Is this effort related to the existing AMPLab Big Data benchmark that
covers Spark, Redshift, Tez, and Impala?
Nick
2014년 10월 31일 금요일, Kay Ousterhoutk...@eecs.berkeley.edu님이 작성한 메시지:
There's been an effort in the AMPLab at Berkeley to set up a shared
codebase that makes it easy to run
Hi Nick,
No -- we're doing a much more constrained thing of just trying to get
things set up to easily run TPC-DS on SparkSQL (which involves generating
the data, storing it in HDFS, getting all the queries in the right format,
etc.).
Cloudera does have a repo here:
Hey Nick,
Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
developers when running this. It is really easy to make one system
look better than others when you are running a benchmark yourself
because tuning and sizing can lead to a 10X performance improvement.
This benchmark
Thanks for the response, Patrick.
I guess the key takeaways are 1) the tuning/config details are everything
(they're not laid out here), 2) the benchmark should be reproducible (it's
not), and 3) reach out to the relevant devs before publishing (didn't
happen).
Probably key takeaways for any
To be fair, we (Spark community) haven’t been any better, for example this
benchmark:
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
For which no details or code have been released to allow others to
reproduce it. I would encourage anyone doing a Spark benchmark in
I believe that benchmark has a pending certification on it. See
http://sortbenchmark.org under Process.
It's true they did not share enough details on the blog for readers to
reproduce the benchmark, but they will have to share enough with the
committee behind the benchmark in order to be
There's been an effort in the AMPLab at Berkeley to set up a shared
codebase that makes it easy to run TPC-DS on SparkSQL, since it's something
we do frequently in the lab to evaluate new research. Based on this
thread, it sounds like making this more widely-available is something that
would be
18 matches
Mail list logo