Re: Benchmarking Calcite - starting the conversation on the targets and design of the benchmark

2018-02-05 Thread Edmon Begoli
I think that "plus" is a good starting point for a general benchmark, and
then "ubenchmark" maybe for fine-grained profiling of the sub-components
such as planner, etc..

On Mon, Feb 5, 2018 at 8:08 PM, Julian Hyde  wrote:

> Note that Calcite has a “plus” module which is a place to add other data
> sets (e.g. TPC-H, TPD-DS) and tests and benchmarks based on them. Also the
> “ubenchmark” module for micro-benchmarks. I don’t know whether the work you
> are planning would be a natural fit within these modules.
>
> > On Feb 5, 2018, at 4:38 PM, Edmon Begoli  wrote:
> >
> > I am going to create two JIRA issues:
> >
> > 1. Development of the benchmark for Calcite.
> >
> > 2. An R development focused on benchmarking, performance evaluation,
> and
> > a study.
> >
> > Thank you,
> > Edmon
> >
> > On Mon, Feb 5, 2018 at 9:26 AM, Michael Mior  wrote:
> >
> >> One interesting exercise would also be to pick a popular benchmark (e.g.
> >> TPC-H) and just look at the plan produced by Calcite vs existing RDBMS
> >> optimizers (e.g. Postgres, MySQL). Along with performance analysis of
> the
> >> various options, it seems there's a paper in there.
> >>
> >> --
> >> Michael Mior
> >> mm...@apache.org
> >>
> >> 2018-02-03 23:21 GMT-05:00 Edmon Begoli :
> >>
> >>> I am planning on opening an issue, and coordinating an initiative to
> >>> develop a Calcite-focused benchmark.
> >>>
> >>> This would lead to the development of the executable, reportable
> >> benchmark,
> >>> and of the next publication aimed at another significant computer
> science
> >>> conference or a journal.
> >>>
> >>> Before I submit a JIRA issue, i would like to get your feedback on what
> >>> this benchmark might be both in terms of what it should benchmark, and
> >> now
> >>> it should be implemented.
> >>>
> >>> Couple of preliminary thoughts that came out of the conversation with
> the
> >>> co-authors of our SIGMOD paper are:
> >>>
> >>> * Optimizer runtime for complex queries (we could also compare with the
> >>> runtime of executing the optimized query directly)
> >>> * Calcite optimized query
> >>> * Unoptimized query with the optimizer of the backend disabled
> >>> * Unoptimized query with the optimizer of the backend enabled
> >>> * Overhead of going through Calcite adapters vs. natively accessing the
> >>> target DB
> >>> * Comparison with other federated query processing engines such as
> Spark
> >>> SQL and PrestoDB
> >>> * use TCP-H or DS for this purpose
> >>> * use Star Schema Benchmark (SSB)
> >>> * Planning and execution time with queries that span across multiple
> >>> systems (e.g. Postgres and Cassandra, Postgres and Pig, Pig and
> >> Cassandra).
> >>>
> >>>
> >>>
> >>> Follow approaches similar to:
> >>> * https://www.slideshare.net/julianhyde/w-435phyde-3
> >>> *
> >>> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/
> >>> bk_hive-performance-tuning/content/ch_cost-based-optimizer.html
> >>> * (How much of this is still relevant (Hive 0.14)? Can we use
> >>> queries/benchmarks?)
> >>> https://hortonworks.com/blog/hive-0-14-cost-based-
> >> optimizer-cbo-technical-
> >>> overview/
> >>>
> >>>
> >>> Please share your suggestions.
> >>>
> >>
>
>


Re: Benchmarking Calcite - starting the conversation on the targets and design of the benchmark

2018-02-05 Thread Julian Hyde
Note that Calcite has a “plus” module which is a place to add other data sets 
(e.g. TPC-H, TPD-DS) and tests and benchmarks based on them. Also the 
“ubenchmark” module for micro-benchmarks. I don’t know whether the work you are 
planning would be a natural fit within these modules.

> On Feb 5, 2018, at 4:38 PM, Edmon Begoli  wrote:
> 
> I am going to create two JIRA issues:
> 
> 1. Development of the benchmark for Calcite.
> 
> 2. An R development focused on benchmarking, performance evaluation, and
> a study.
> 
> Thank you,
> Edmon
> 
> On Mon, Feb 5, 2018 at 9:26 AM, Michael Mior  wrote:
> 
>> One interesting exercise would also be to pick a popular benchmark (e.g.
>> TPC-H) and just look at the plan produced by Calcite vs existing RDBMS
>> optimizers (e.g. Postgres, MySQL). Along with performance analysis of the
>> various options, it seems there's a paper in there.
>> 
>> --
>> Michael Mior
>> mm...@apache.org
>> 
>> 2018-02-03 23:21 GMT-05:00 Edmon Begoli :
>> 
>>> I am planning on opening an issue, and coordinating an initiative to
>>> develop a Calcite-focused benchmark.
>>> 
>>> This would lead to the development of the executable, reportable
>> benchmark,
>>> and of the next publication aimed at another significant computer science
>>> conference or a journal.
>>> 
>>> Before I submit a JIRA issue, i would like to get your feedback on what
>>> this benchmark might be both in terms of what it should benchmark, and
>> now
>>> it should be implemented.
>>> 
>>> Couple of preliminary thoughts that came out of the conversation with the
>>> co-authors of our SIGMOD paper are:
>>> 
>>> * Optimizer runtime for complex queries (we could also compare with the
>>> runtime of executing the optimized query directly)
>>> * Calcite optimized query
>>> * Unoptimized query with the optimizer of the backend disabled
>>> * Unoptimized query with the optimizer of the backend enabled
>>> * Overhead of going through Calcite adapters vs. natively accessing the
>>> target DB
>>> * Comparison with other federated query processing engines such as Spark
>>> SQL and PrestoDB
>>> * use TCP-H or DS for this purpose
>>> * use Star Schema Benchmark (SSB)
>>> * Planning and execution time with queries that span across multiple
>>> systems (e.g. Postgres and Cassandra, Postgres and Pig, Pig and
>> Cassandra).
>>> 
>>> 
>>> 
>>> Follow approaches similar to:
>>> * https://www.slideshare.net/julianhyde/w-435phyde-3
>>> *
>>> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/
>>> bk_hive-performance-tuning/content/ch_cost-based-optimizer.html
>>> * (How much of this is still relevant (Hive 0.14)? Can we use
>>> queries/benchmarks?)
>>> https://hortonworks.com/blog/hive-0-14-cost-based-
>> optimizer-cbo-technical-
>>> overview/
>>> 
>>> 
>>> Please share your suggestions.
>>> 
>> 



Re: Benchmarking Calcite - starting the conversation on the targets and design of the benchmark

2018-02-05 Thread Edmon Begoli
I am going to create two JIRA issues:

1. Development of the benchmark for Calcite.

2. An R development focused on benchmarking, performance evaluation, and
a study.

Thank you,
Edmon

On Mon, Feb 5, 2018 at 9:26 AM, Michael Mior  wrote:

> One interesting exercise would also be to pick a popular benchmark (e.g.
> TPC-H) and just look at the plan produced by Calcite vs existing RDBMS
> optimizers (e.g. Postgres, MySQL). Along with performance analysis of the
> various options, it seems there's a paper in there.
>
> --
> Michael Mior
> mm...@apache.org
>
> 2018-02-03 23:21 GMT-05:00 Edmon Begoli :
>
> > I am planning on opening an issue, and coordinating an initiative to
> > develop a Calcite-focused benchmark.
> >
> > This would lead to the development of the executable, reportable
> benchmark,
> > and of the next publication aimed at another significant computer science
> > conference or a journal.
> >
> > Before I submit a JIRA issue, i would like to get your feedback on what
> > this benchmark might be both in terms of what it should benchmark, and
> now
> > it should be implemented.
> >
> > Couple of preliminary thoughts that came out of the conversation with the
> > co-authors of our SIGMOD paper are:
> >
> > * Optimizer runtime for complex queries (we could also compare with the
> > runtime of executing the optimized query directly)
> > * Calcite optimized query
> > * Unoptimized query with the optimizer of the backend disabled
> > * Unoptimized query with the optimizer of the backend enabled
> > * Overhead of going through Calcite adapters vs. natively accessing the
> > target DB
> > * Comparison with other federated query processing engines such as Spark
> > SQL and PrestoDB
> > * use TCP-H or DS for this purpose
> > * use Star Schema Benchmark (SSB)
> > * Planning and execution time with queries that span across multiple
> > systems (e.g. Postgres and Cassandra, Postgres and Pig, Pig and
> Cassandra).
> >
> >
> >
> > Follow approaches similar to:
> > * https://www.slideshare.net/julianhyde/w-435phyde-3
> > *
> > https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/
> > bk_hive-performance-tuning/content/ch_cost-based-optimizer.html
> > * (How much of this is still relevant (Hive 0.14)? Can we use
> > queries/benchmarks?)
> > https://hortonworks.com/blog/hive-0-14-cost-based-
> optimizer-cbo-technical-
> > overview/
> >
> >
> > Please share your suggestions.
> >
>


Re: Benchmarking Calcite - starting the conversation on the targets and design of the benchmark

2018-02-05 Thread Ruhollah Farchtchi
I would think that a TPC-DS benchmark would be more appropriate for the
type of queries I'd be interested in working with Calcite. Also as an end
result of these efforts I would imagine the community would get better
instrumentation of metrics up and down the query processing pipeline. From
parsing to optimizing, rewrites, etc.. This would be interesting even as a
feature to use in conjunction with the lattice framework to decide what
queries to eventually build lattices as an estimation of time savings.

Ruhollah Farchtchi
ruhollah.farcht...@gmail.com

On Mon, Feb 5, 2018 at 9:26 AM, Michael Mior  wrote:

> One interesting exercise would also be to pick a popular benchmark (e.g.
> TPC-H) and just look at the plan produced by Calcite vs existing RDBMS
> optimizers (e.g. Postgres, MySQL). Along with performance analysis of the
> various options, it seems there's a paper in there.
>
> --
> Michael Mior
> mm...@apache.org
>
> 2018-02-03 23:21 GMT-05:00 Edmon Begoli :
>
> > I am planning on opening an issue, and coordinating an initiative to
> > develop a Calcite-focused benchmark.
> >
> > This would lead to the development of the executable, reportable
> benchmark,
> > and of the next publication aimed at another significant computer science
> > conference or a journal.
> >
> > Before I submit a JIRA issue, i would like to get your feedback on what
> > this benchmark might be both in terms of what it should benchmark, and
> now
> > it should be implemented.
> >
> > Couple of preliminary thoughts that came out of the conversation with the
> > co-authors of our SIGMOD paper are:
> >
> > * Optimizer runtime for complex queries (we could also compare with the
> > runtime of executing the optimized query directly)
> > * Calcite optimized query
> > * Unoptimized query with the optimizer of the backend disabled
> > * Unoptimized query with the optimizer of the backend enabled
> > * Overhead of going through Calcite adapters vs. natively accessing the
> > target DB
> > * Comparison with other federated query processing engines such as Spark
> > SQL and PrestoDB
> > * use TCP-H or DS for this purpose
> > * use Star Schema Benchmark (SSB)
> > * Planning and execution time with queries that span across multiple
> > systems (e.g. Postgres and Cassandra, Postgres and Pig, Pig and
> Cassandra).
> >
> >
> >
> > Follow approaches similar to:
> > * https://www.slideshare.net/julianhyde/w-435phyde-3
> > *
> > https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/
> > bk_hive-performance-tuning/content/ch_cost-based-optimizer.html
> > * (How much of this is still relevant (Hive 0.14)? Can we use
> > queries/benchmarks?)
> > https://hortonworks.com/blog/hive-0-14-cost-based-
> optimizer-cbo-technical-
> > overview/
> >
> >
> > Please share your suggestions.
> >
>


Re: Benchmarking Calcite - starting the conversation on the targets and design of the benchmark

2018-02-05 Thread Michael Mior
One interesting exercise would also be to pick a popular benchmark (e.g.
TPC-H) and just look at the plan produced by Calcite vs existing RDBMS
optimizers (e.g. Postgres, MySQL). Along with performance analysis of the
various options, it seems there's a paper in there.

--
Michael Mior
mm...@apache.org

2018-02-03 23:21 GMT-05:00 Edmon Begoli :

> I am planning on opening an issue, and coordinating an initiative to
> develop a Calcite-focused benchmark.
>
> This would lead to the development of the executable, reportable benchmark,
> and of the next publication aimed at another significant computer science
> conference or a journal.
>
> Before I submit a JIRA issue, i would like to get your feedback on what
> this benchmark might be both in terms of what it should benchmark, and now
> it should be implemented.
>
> Couple of preliminary thoughts that came out of the conversation with the
> co-authors of our SIGMOD paper are:
>
> * Optimizer runtime for complex queries (we could also compare with the
> runtime of executing the optimized query directly)
> * Calcite optimized query
> * Unoptimized query with the optimizer of the backend disabled
> * Unoptimized query with the optimizer of the backend enabled
> * Overhead of going through Calcite adapters vs. natively accessing the
> target DB
> * Comparison with other federated query processing engines such as Spark
> SQL and PrestoDB
> * use TCP-H or DS for this purpose
> * use Star Schema Benchmark (SSB)
> * Planning and execution time with queries that span across multiple
> systems (e.g. Postgres and Cassandra, Postgres and Pig, Pig and Cassandra).
>
>
>
> Follow approaches similar to:
> * https://www.slideshare.net/julianhyde/w-435phyde-3
> *
> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/
> bk_hive-performance-tuning/content/ch_cost-based-optimizer.html
> * (How much of this is still relevant (Hive 0.14)? Can we use
> queries/benchmarks?)
> https://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-
> overview/
>
>
> Please share your suggestions.
>


Re: Benchmarking Calcite - starting the conversation on the targets and design of the benchmark

2018-02-04 Thread Riccardo Tommasini
Regarding the Comparison with other federated query processing engines such as 
Spark

SQL and PrestoDB


What about KSQL [1] and InfluxDB recent work (looking for a link)


[1] https://github.com/confluentinc/ksql

RT

On 4 Feb 2018, 16:35 +0100, Edmon Begoli , wrote:

* Comparison with other federated query processing engines such as Spark
SQL and PrestoDB


Benchmarking Calcite - starting the conversation on the targets and design of the benchmark

2018-02-04 Thread Luis Fernando Kauer
I'd like to share the following link that I came across sime time ago about a 
product that uses Calcite to optimize Spark queries. 
https://www.datascience.com/blog/grunion-data-science-tools-query-optimizer-apache-spark





Benchmarking Calcite - starting the conversation on the targets and design of the benchmark

2018-02-03 Thread Edmon Begoli
I am planning on opening an issue, and coordinating an initiative to
develop a Calcite-focused benchmark.

This would lead to the development of the executable, reportable benchmark,
and of the next publication aimed at another significant computer science
conference or a journal.

Before I submit a JIRA issue, i would like to get your feedback on what
this benchmark might be both in terms of what it should benchmark, and now
it should be implemented.

Couple of preliminary thoughts that came out of the conversation with the
co-authors of our SIGMOD paper are:

* Optimizer runtime for complex queries (we could also compare with the
runtime of executing the optimized query directly)
* Calcite optimized query
* Unoptimized query with the optimizer of the backend disabled
* Unoptimized query with the optimizer of the backend enabled
* Overhead of going through Calcite adapters vs. natively accessing the
target DB
* Comparison with other federated query processing engines such as Spark
SQL and PrestoDB
* use TCP-H or DS for this purpose
* use Star Schema Benchmark (SSB)
* Planning and execution time with queries that span across multiple
systems (e.g. Postgres and Cassandra, Postgres and Pig, Pig and Cassandra).



Follow approaches similar to:
* https://www.slideshare.net/julianhyde/w-435phyde-3
*
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_hive-performance-tuning/content/ch_cost-based-optimizer.html
* (How much of this is still relevant (Hive 0.14)? Can we use
queries/benchmarks?)
https://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/


Please share your suggestions.