Re: Benchmarking Calcite - starting the conversation on the targets and design of the benchmark
I think that "plus" is a good starting point for a general benchmark, and then "ubenchmark" maybe for fine-grained profiling of the sub-components such as planner, etc.. On Mon, Feb 5, 2018 at 8:08 PM, Julian Hydewrote: > Note that Calcite has a “plus” module which is a place to add other data > sets (e.g. TPC-H, TPD-DS) and tests and benchmarks based on them. Also the > “ubenchmark” module for micro-benchmarks. I don’t know whether the work you > are planning would be a natural fit within these modules. > > > On Feb 5, 2018, at 4:38 PM, Edmon Begoli wrote: > > > > I am going to create two JIRA issues: > > > > 1. Development of the benchmark for Calcite. > > > > 2. An R development focused on benchmarking, performance evaluation, > and > > a study. > > > > Thank you, > > Edmon > > > > On Mon, Feb 5, 2018 at 9:26 AM, Michael Mior wrote: > > > >> One interesting exercise would also be to pick a popular benchmark (e.g. > >> TPC-H) and just look at the plan produced by Calcite vs existing RDBMS > >> optimizers (e.g. Postgres, MySQL). Along with performance analysis of > the > >> various options, it seems there's a paper in there. > >> > >> -- > >> Michael Mior > >> mm...@apache.org > >> > >> 2018-02-03 23:21 GMT-05:00 Edmon Begoli : > >> > >>> I am planning on opening an issue, and coordinating an initiative to > >>> develop a Calcite-focused benchmark. > >>> > >>> This would lead to the development of the executable, reportable > >> benchmark, > >>> and of the next publication aimed at another significant computer > science > >>> conference or a journal. > >>> > >>> Before I submit a JIRA issue, i would like to get your feedback on what > >>> this benchmark might be both in terms of what it should benchmark, and > >> now > >>> it should be implemented. > >>> > >>> Couple of preliminary thoughts that came out of the conversation with > the > >>> co-authors of our SIGMOD paper are: > >>> > >>> * Optimizer runtime for complex queries (we could also compare with the > >>> runtime of executing the optimized query directly) > >>> * Calcite optimized query > >>> * Unoptimized query with the optimizer of the backend disabled > >>> * Unoptimized query with the optimizer of the backend enabled > >>> * Overhead of going through Calcite adapters vs. natively accessing the > >>> target DB > >>> * Comparison with other federated query processing engines such as > Spark > >>> SQL and PrestoDB > >>> * use TCP-H or DS for this purpose > >>> * use Star Schema Benchmark (SSB) > >>> * Planning and execution time with queries that span across multiple > >>> systems (e.g. Postgres and Cassandra, Postgres and Pig, Pig and > >> Cassandra). > >>> > >>> > >>> > >>> Follow approaches similar to: > >>> * https://www.slideshare.net/julianhyde/w-435phyde-3 > >>> * > >>> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/ > >>> bk_hive-performance-tuning/content/ch_cost-based-optimizer.html > >>> * (How much of this is still relevant (Hive 0.14)? Can we use > >>> queries/benchmarks?) > >>> https://hortonworks.com/blog/hive-0-14-cost-based- > >> optimizer-cbo-technical- > >>> overview/ > >>> > >>> > >>> Please share your suggestions. > >>> > >> > >
Re: Benchmarking Calcite - starting the conversation on the targets and design of the benchmark
Note that Calcite has a “plus” module which is a place to add other data sets (e.g. TPC-H, TPD-DS) and tests and benchmarks based on them. Also the “ubenchmark” module for micro-benchmarks. I don’t know whether the work you are planning would be a natural fit within these modules. > On Feb 5, 2018, at 4:38 PM, Edmon Begoliwrote: > > I am going to create two JIRA issues: > > 1. Development of the benchmark for Calcite. > > 2. An R development focused on benchmarking, performance evaluation, and > a study. > > Thank you, > Edmon > > On Mon, Feb 5, 2018 at 9:26 AM, Michael Mior wrote: > >> One interesting exercise would also be to pick a popular benchmark (e.g. >> TPC-H) and just look at the plan produced by Calcite vs existing RDBMS >> optimizers (e.g. Postgres, MySQL). Along with performance analysis of the >> various options, it seems there's a paper in there. >> >> -- >> Michael Mior >> mm...@apache.org >> >> 2018-02-03 23:21 GMT-05:00 Edmon Begoli : >> >>> I am planning on opening an issue, and coordinating an initiative to >>> develop a Calcite-focused benchmark. >>> >>> This would lead to the development of the executable, reportable >> benchmark, >>> and of the next publication aimed at another significant computer science >>> conference or a journal. >>> >>> Before I submit a JIRA issue, i would like to get your feedback on what >>> this benchmark might be both in terms of what it should benchmark, and >> now >>> it should be implemented. >>> >>> Couple of preliminary thoughts that came out of the conversation with the >>> co-authors of our SIGMOD paper are: >>> >>> * Optimizer runtime for complex queries (we could also compare with the >>> runtime of executing the optimized query directly) >>> * Calcite optimized query >>> * Unoptimized query with the optimizer of the backend disabled >>> * Unoptimized query with the optimizer of the backend enabled >>> * Overhead of going through Calcite adapters vs. natively accessing the >>> target DB >>> * Comparison with other federated query processing engines such as Spark >>> SQL and PrestoDB >>> * use TCP-H or DS for this purpose >>> * use Star Schema Benchmark (SSB) >>> * Planning and execution time with queries that span across multiple >>> systems (e.g. Postgres and Cassandra, Postgres and Pig, Pig and >> Cassandra). >>> >>> >>> >>> Follow approaches similar to: >>> * https://www.slideshare.net/julianhyde/w-435phyde-3 >>> * >>> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/ >>> bk_hive-performance-tuning/content/ch_cost-based-optimizer.html >>> * (How much of this is still relevant (Hive 0.14)? Can we use >>> queries/benchmarks?) >>> https://hortonworks.com/blog/hive-0-14-cost-based- >> optimizer-cbo-technical- >>> overview/ >>> >>> >>> Please share your suggestions. >>> >>
Re: Benchmarking Calcite - starting the conversation on the targets and design of the benchmark
I am going to create two JIRA issues: 1. Development of the benchmark for Calcite. 2. An R development focused on benchmarking, performance evaluation, and a study. Thank you, Edmon On Mon, Feb 5, 2018 at 9:26 AM, Michael Miorwrote: > One interesting exercise would also be to pick a popular benchmark (e.g. > TPC-H) and just look at the plan produced by Calcite vs existing RDBMS > optimizers (e.g. Postgres, MySQL). Along with performance analysis of the > various options, it seems there's a paper in there. > > -- > Michael Mior > mm...@apache.org > > 2018-02-03 23:21 GMT-05:00 Edmon Begoli : > > > I am planning on opening an issue, and coordinating an initiative to > > develop a Calcite-focused benchmark. > > > > This would lead to the development of the executable, reportable > benchmark, > > and of the next publication aimed at another significant computer science > > conference or a journal. > > > > Before I submit a JIRA issue, i would like to get your feedback on what > > this benchmark might be both in terms of what it should benchmark, and > now > > it should be implemented. > > > > Couple of preliminary thoughts that came out of the conversation with the > > co-authors of our SIGMOD paper are: > > > > * Optimizer runtime for complex queries (we could also compare with the > > runtime of executing the optimized query directly) > > * Calcite optimized query > > * Unoptimized query with the optimizer of the backend disabled > > * Unoptimized query with the optimizer of the backend enabled > > * Overhead of going through Calcite adapters vs. natively accessing the > > target DB > > * Comparison with other federated query processing engines such as Spark > > SQL and PrestoDB > > * use TCP-H or DS for this purpose > > * use Star Schema Benchmark (SSB) > > * Planning and execution time with queries that span across multiple > > systems (e.g. Postgres and Cassandra, Postgres and Pig, Pig and > Cassandra). > > > > > > > > Follow approaches similar to: > > * https://www.slideshare.net/julianhyde/w-435phyde-3 > > * > > https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/ > > bk_hive-performance-tuning/content/ch_cost-based-optimizer.html > > * (How much of this is still relevant (Hive 0.14)? Can we use > > queries/benchmarks?) > > https://hortonworks.com/blog/hive-0-14-cost-based- > optimizer-cbo-technical- > > overview/ > > > > > > Please share your suggestions. > > >
Re: Benchmarking Calcite - starting the conversation on the targets and design of the benchmark
I would think that a TPC-DS benchmark would be more appropriate for the type of queries I'd be interested in working with Calcite. Also as an end result of these efforts I would imagine the community would get better instrumentation of metrics up and down the query processing pipeline. From parsing to optimizing, rewrites, etc.. This would be interesting even as a feature to use in conjunction with the lattice framework to decide what queries to eventually build lattices as an estimation of time savings. Ruhollah Farchtchi ruhollah.farcht...@gmail.com On Mon, Feb 5, 2018 at 9:26 AM, Michael Miorwrote: > One interesting exercise would also be to pick a popular benchmark (e.g. > TPC-H) and just look at the plan produced by Calcite vs existing RDBMS > optimizers (e.g. Postgres, MySQL). Along with performance analysis of the > various options, it seems there's a paper in there. > > -- > Michael Mior > mm...@apache.org > > 2018-02-03 23:21 GMT-05:00 Edmon Begoli : > > > I am planning on opening an issue, and coordinating an initiative to > > develop a Calcite-focused benchmark. > > > > This would lead to the development of the executable, reportable > benchmark, > > and of the next publication aimed at another significant computer science > > conference or a journal. > > > > Before I submit a JIRA issue, i would like to get your feedback on what > > this benchmark might be both in terms of what it should benchmark, and > now > > it should be implemented. > > > > Couple of preliminary thoughts that came out of the conversation with the > > co-authors of our SIGMOD paper are: > > > > * Optimizer runtime for complex queries (we could also compare with the > > runtime of executing the optimized query directly) > > * Calcite optimized query > > * Unoptimized query with the optimizer of the backend disabled > > * Unoptimized query with the optimizer of the backend enabled > > * Overhead of going through Calcite adapters vs. natively accessing the > > target DB > > * Comparison with other federated query processing engines such as Spark > > SQL and PrestoDB > > * use TCP-H or DS for this purpose > > * use Star Schema Benchmark (SSB) > > * Planning and execution time with queries that span across multiple > > systems (e.g. Postgres and Cassandra, Postgres and Pig, Pig and > Cassandra). > > > > > > > > Follow approaches similar to: > > * https://www.slideshare.net/julianhyde/w-435phyde-3 > > * > > https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/ > > bk_hive-performance-tuning/content/ch_cost-based-optimizer.html > > * (How much of this is still relevant (Hive 0.14)? Can we use > > queries/benchmarks?) > > https://hortonworks.com/blog/hive-0-14-cost-based- > optimizer-cbo-technical- > > overview/ > > > > > > Please share your suggestions. > > >
Re: Benchmarking Calcite - starting the conversation on the targets and design of the benchmark
One interesting exercise would also be to pick a popular benchmark (e.g. TPC-H) and just look at the plan produced by Calcite vs existing RDBMS optimizers (e.g. Postgres, MySQL). Along with performance analysis of the various options, it seems there's a paper in there. -- Michael Mior mm...@apache.org 2018-02-03 23:21 GMT-05:00 Edmon Begoli: > I am planning on opening an issue, and coordinating an initiative to > develop a Calcite-focused benchmark. > > This would lead to the development of the executable, reportable benchmark, > and of the next publication aimed at another significant computer science > conference or a journal. > > Before I submit a JIRA issue, i would like to get your feedback on what > this benchmark might be both in terms of what it should benchmark, and now > it should be implemented. > > Couple of preliminary thoughts that came out of the conversation with the > co-authors of our SIGMOD paper are: > > * Optimizer runtime for complex queries (we could also compare with the > runtime of executing the optimized query directly) > * Calcite optimized query > * Unoptimized query with the optimizer of the backend disabled > * Unoptimized query with the optimizer of the backend enabled > * Overhead of going through Calcite adapters vs. natively accessing the > target DB > * Comparison with other federated query processing engines such as Spark > SQL and PrestoDB > * use TCP-H or DS for this purpose > * use Star Schema Benchmark (SSB) > * Planning and execution time with queries that span across multiple > systems (e.g. Postgres and Cassandra, Postgres and Pig, Pig and Cassandra). > > > > Follow approaches similar to: > * https://www.slideshare.net/julianhyde/w-435phyde-3 > * > https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/ > bk_hive-performance-tuning/content/ch_cost-based-optimizer.html > * (How much of this is still relevant (Hive 0.14)? Can we use > queries/benchmarks?) > https://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical- > overview/ > > > Please share your suggestions. >
Re: Benchmarking Calcite - starting the conversation on the targets and design of the benchmark
Regarding the Comparison with other federated query processing engines such as Spark SQL and PrestoDB What about KSQL [1] and InfluxDB recent work (looking for a link) [1] https://github.com/confluentinc/ksql RT On 4 Feb 2018, 16:35 +0100, Edmon Begoli, wrote: * Comparison with other federated query processing engines such as Spark SQL and PrestoDB
Benchmarking Calcite - starting the conversation on the targets and design of the benchmark
I'd like to share the following link that I came across sime time ago about a product that uses Calcite to optimize Spark queries. https://www.datascience.com/blog/grunion-data-science-tools-query-optimizer-apache-spark
Benchmarking Calcite - starting the conversation on the targets and design of the benchmark
I am planning on opening an issue, and coordinating an initiative to develop a Calcite-focused benchmark. This would lead to the development of the executable, reportable benchmark, and of the next publication aimed at another significant computer science conference or a journal. Before I submit a JIRA issue, i would like to get your feedback on what this benchmark might be both in terms of what it should benchmark, and now it should be implemented. Couple of preliminary thoughts that came out of the conversation with the co-authors of our SIGMOD paper are: * Optimizer runtime for complex queries (we could also compare with the runtime of executing the optimized query directly) * Calcite optimized query * Unoptimized query with the optimizer of the backend disabled * Unoptimized query with the optimizer of the backend enabled * Overhead of going through Calcite adapters vs. natively accessing the target DB * Comparison with other federated query processing engines such as Spark SQL and PrestoDB * use TCP-H or DS for this purpose * use Star Schema Benchmark (SSB) * Planning and execution time with queries that span across multiple systems (e.g. Postgres and Cassandra, Postgres and Pig, Pig and Cassandra). Follow approaches similar to: * https://www.slideshare.net/julianhyde/w-435phyde-3 * https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_hive-performance-tuning/content/ch_cost-based-optimizer.html * (How much of this is still relevant (Hive 0.14)? Can we use queries/benchmarks?) https://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/ Please share your suggestions.