[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718192#comment-16718192 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446404202 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718197#comment-16718197 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446404207 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/3/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718194#comment-16718194 ] ASF GitHub Bot commented on SPARK-26203: SparkQA removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446401531 **[Test build #3 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/3/testReport)** for PR 23291 at commit [`987bea4`](https://github.com/apache/spark/commit/987bea48350ed2e3862b965e07d5d5335e1d86c2). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718193#comment-16718193 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446404207 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/3/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718195#comment-16718195 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446404202 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718191#comment-16718191 ] ASF GitHub Bot commented on SPARK-26203: SparkQA commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446404185 **[Test build #3 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/3/testReport)** for PR 23291 at commit [`987bea4`](https://github.com/apache/spark/commit/987bea48350ed2e3862b965e07d5d5335e1d86c2). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718177#comment-16718177 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446401600 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5989/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718176#comment-16718176 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446401593 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718174#comment-16718174 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446401600 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5989/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718173#comment-16718173 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446401593 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718172#comment-16718172 ] ASF GitHub Bot commented on SPARK-26203: SparkQA commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446401531 **[Test build #3 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/3/testReport)** for PR 23291 at commit [`987bea4`](https://github.com/apache/spark/commit/987bea48350ed2e3862b965e07d5d5335e1d86c2). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718056#comment-16718056 ] ASF GitHub Bot commented on SPARK-26203: dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#discussion_r240804589 ## File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.benchmark.Benchmark +import org.apache.spark.sql.catalyst.expressions.In +import org.apache.spark.sql.catalyst.expressions.InSet +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark +import org.apache.spark.sql.functions.{array, struct} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +/** + * A benchmark that compares the performance of [[In]] and [[InSet]] expressions. + * + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/InSetBenchmark-results.txt". + * }}} + */ +object InSetBenchmark extends SqlBasedBenchmark { + + import spark.implicits._ + + def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems bytes" +val values = (1 to numItems).map(v => s"CAST($v AS tinyint)") +val df = spark.range(1, numRows).select($"id".cast(ByteType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems shorts" +val values = (1 to numItems).map(v => s"CAST($v AS smallint)") +val df = spark.range(1, numRows).select($"id".cast(ShortType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems ints" +val values = 1 to numItems +val df = spark.range(1, numRows).select($"id".cast(IntegerType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems longs" +val values = (1 to numItems).map(v => s"${v}L") +val df = spark.range(1, numRows).toDF("id") +benchmark(name, df, values, numRows, minNumIters) + } + + def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems floats" +val values = (1 to numItems).map(v => s"CAST($v AS float)") +val df = spark.range(1, numRows).select($"id".cast(FloatType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems doubles" +val values = 1.0 to numItems by 1.0 +val df = spark.range(1, numRows).select($"id".cast(DoubleType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems small decimals" +val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1))) +benchmark(name, df, values, numRows, minNumIters) + } + + def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems large decimals" +val values = (1 to numItems).map(v => s"9223372036854775812.10539$v") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7))) +benchmark(name, df, values, numRows, minNumIters) + } + + def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems strings" +val values = (1 to numItems).map(n => s"'$n'") +val df = spark.range(1,
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718042#comment-16718042 ] ASF GitHub Bot commented on SPARK-26203: aokolnychyi commented on a change in pull request #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#discussion_r240800556 ## File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.benchmark.Benchmark +import org.apache.spark.sql.catalyst.expressions.In +import org.apache.spark.sql.catalyst.expressions.InSet +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark +import org.apache.spark.sql.functions.{array, struct} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +/** + * A benchmark that compares the performance of [[In]] and [[InSet]] expressions. + * + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/InSetBenchmark-results.txt". + * }}} + */ +object InSetBenchmark extends SqlBasedBenchmark { + + import spark.implicits._ + + def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems bytes" +val values = (1 to numItems).map(v => s"CAST($v AS tinyint)") +val df = spark.range(1, numRows).select($"id".cast(ByteType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems shorts" +val values = (1 to numItems).map(v => s"CAST($v AS smallint)") +val df = spark.range(1, numRows).select($"id".cast(ShortType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems ints" +val values = 1 to numItems +val df = spark.range(1, numRows).select($"id".cast(IntegerType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems longs" +val values = (1 to numItems).map(v => s"${v}L") +val df = spark.range(1, numRows).toDF("id") +benchmark(name, df, values, numRows, minNumIters) + } + + def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems floats" +val values = (1 to numItems).map(v => s"CAST($v AS float)") +val df = spark.range(1, numRows).select($"id".cast(FloatType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems doubles" +val values = 1.0 to numItems by 1.0 +val df = spark.range(1, numRows).select($"id".cast(DoubleType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems small decimals" +val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1))) +benchmark(name, df, values, numRows, minNumIters) + } + + def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems large decimals" +val values = (1 to numItems).map(v => s"9223372036854775812.10539$v") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7))) +benchmark(name, df, values, numRows, minNumIters) + } + + def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems strings" +val values = (1 to numItems).map(n => s"'$n'") +val df = spark.range(1,
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718019#comment-16718019 ] ASF GitHub Bot commented on SPARK-26203: aokolnychyi commented on a change in pull request #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#discussion_r240795529 ## File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.benchmark.Benchmark +import org.apache.spark.sql.catalyst.expressions.In +import org.apache.spark.sql.catalyst.expressions.InSet +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark +import org.apache.spark.sql.functions.{array, struct} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +/** + * A benchmark that compares the performance of [[In]] and [[InSet]] expressions. + * + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/InSetBenchmark-results.txt". + * }}} + */ +object InSetBenchmark extends SqlBasedBenchmark { + + import spark.implicits._ + + def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems bytes" +val values = (1 to numItems).map(v => s"CAST($v AS tinyint)") +val df = spark.range(1, numRows).select($"id".cast(ByteType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems shorts" +val values = (1 to numItems).map(v => s"CAST($v AS smallint)") +val df = spark.range(1, numRows).select($"id".cast(ShortType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems ints" +val values = 1 to numItems +val df = spark.range(1, numRows).select($"id".cast(IntegerType)) +benchmark(name, df, values, numRows, minNumIters) Review comment: We can do this. Shall we rename `intBenchmark` to `runIntBenchmark` then? There is no consistency in existing benchmarks, unfortunately. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718022#comment-16718022 ] ASF GitHub Bot commented on SPARK-26203: dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#discussion_r240796620 ## File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.benchmark.Benchmark +import org.apache.spark.sql.catalyst.expressions.In +import org.apache.spark.sql.catalyst.expressions.InSet +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark +import org.apache.spark.sql.functions.{array, struct} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +/** + * A benchmark that compares the performance of [[In]] and [[InSet]] expressions. + * + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/InSetBenchmark-results.txt". + * }}} + */ +object InSetBenchmark extends SqlBasedBenchmark { + + import spark.implicits._ + + def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems bytes" +val values = (1 to numItems).map(v => s"CAST($v AS tinyint)") +val df = spark.range(1, numRows).select($"id".cast(ByteType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems shorts" +val values = (1 to numItems).map(v => s"CAST($v AS smallint)") +val df = spark.range(1, numRows).select($"id".cast(ShortType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems ints" +val values = 1 to numItems +val df = spark.range(1, numRows).select($"id".cast(IntegerType)) +benchmark(name, df, values, numRows, minNumIters) Review comment: +1 for renaming. Yep. We overlooked the naming consistency in the previous benchmarks. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717886#comment-16717886 ] ASF GitHub Bot commented on SPARK-26203: dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#discussion_r240769809 ## File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.benchmark.Benchmark +import org.apache.spark.sql.catalyst.expressions.In +import org.apache.spark.sql.catalyst.expressions.InSet +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark +import org.apache.spark.sql.functions.{array, struct} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +/** + * A benchmark that compares the performance of [[In]] and [[InSet]] expressions. + * + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/InSetBenchmark-results.txt". + * }}} + */ +object InSetBenchmark extends SqlBasedBenchmark { + + import spark.implicits._ + + def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems bytes" +val values = (1 to numItems).map(v => s"CAST($v AS tinyint)") +val df = spark.range(1, numRows).select($"id".cast(ByteType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems shorts" +val values = (1 to numItems).map(v => s"CAST($v AS smallint)") +val df = spark.range(1, numRows).select($"id".cast(ShortType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems ints" +val values = 1 to numItems +val df = spark.range(1, numRows).select($"id".cast(IntegerType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems longs" +val values = (1 to numItems).map(v => s"${v}L") +val df = spark.range(1, numRows).toDF("id") +benchmark(name, df, values, numRows, minNumIters) + } + + def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems floats" +val values = (1 to numItems).map(v => s"CAST($v AS float)") +val df = spark.range(1, numRows).select($"id".cast(FloatType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems doubles" +val values = 1.0 to numItems by 1.0 +val df = spark.range(1, numRows).select($"id".cast(DoubleType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems small decimals" +val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1))) +benchmark(name, df, values, numRows, minNumIters) + } + + def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems large decimals" +val values = (1 to numItems).map(v => s"9223372036854775812.10539$v") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7))) +benchmark(name, df, values, numRows, minNumIters) + } + + def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems strings" +val values = (1 to numItems).map(n => s"'$n'") +val df = spark.range(1,
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717878#comment-16717878 ] ASF GitHub Bot commented on SPARK-26203: dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#discussion_r240766306 ## File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.benchmark.Benchmark +import org.apache.spark.sql.catalyst.expressions.In +import org.apache.spark.sql.catalyst.expressions.InSet +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark +import org.apache.spark.sql.functions.{array, struct} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +/** + * A benchmark that compares the performance of [[In]] and [[InSet]] expressions. + * + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/InSetBenchmark-results.txt". + * }}} + */ +object InSetBenchmark extends SqlBasedBenchmark { + + import spark.implicits._ + + def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems bytes" +val values = (1 to numItems).map(v => s"CAST($v AS tinyint)") +val df = spark.range(1, numRows).select($"id".cast(ByteType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems shorts" +val values = (1 to numItems).map(v => s"CAST($v AS smallint)") +val df = spark.range(1, numRows).select($"id".cast(ShortType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems ints" +val values = 1 to numItems +val df = spark.range(1, numRows).select($"id".cast(IntegerType)) +benchmark(name, df, values, numRows, minNumIters) Review comment: nit. Shall we invoke `run` here intead of `intBenchmark(...).run()`? It seems to be repeated multiple times. ``` - benchmark(name, df, values, numRows, minNumIters) + benchmark(name, df, values, numRows, minNumIters).run() ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717879#comment-16717879 ] ASF GitHub Bot commented on SPARK-26203: dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#discussion_r240766592 ## File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.benchmark.Benchmark +import org.apache.spark.sql.catalyst.expressions.In +import org.apache.spark.sql.catalyst.expressions.InSet +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark +import org.apache.spark.sql.functions.{array, struct} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +/** + * A benchmark that compares the performance of [[In]] and [[InSet]] expressions. + * + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/InSetBenchmark-results.txt". + * }}} + */ +object InSetBenchmark extends SqlBasedBenchmark { + + import spark.implicits._ + + def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { Review comment: `def` -> `private def` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717865#comment-16717865 ] ASF GitHub Bot commented on SPARK-26203: aokolnychyi commented on a change in pull request #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#discussion_r240762538 ## File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.benchmark.Benchmark +import org.apache.spark.sql.catalyst.expressions.In +import org.apache.spark.sql.catalyst.expressions.InSet +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark +import org.apache.spark.sql.functions.{array, struct} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +/** + * A benchmark that compares the performance of [[In]] and [[InSet]] expressions. + * + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/InSetBenchmark-results.txt". + * }}} + */ +object InSetBenchmark extends SqlBasedBenchmark { + + import spark.implicits._ + + def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems bytes" +val values = (1 to numItems).map(v => s"CAST($v AS tinyint)") +val df = spark.range(1, numRows).select($"id".cast(ByteType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems shorts" +val values = (1 to numItems).map(v => s"CAST($v AS smallint)") +val df = spark.range(1, numRows).select($"id".cast(ShortType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems ints" +val values = 1 to numItems +val df = spark.range(1, numRows).select($"id".cast(IntegerType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems longs" +val values = (1 to numItems).map(v => s"${v}L") +val df = spark.range(1, numRows).toDF("id") +benchmark(name, df, values, numRows, minNumIters) + } + + def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems floats" +val values = (1 to numItems).map(v => s"CAST($v AS float)") +val df = spark.range(1, numRows).select($"id".cast(FloatType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems doubles" +val values = 1.0 to numItems by 1.0 +val df = spark.range(1, numRows).select($"id".cast(DoubleType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems small decimals" +val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1))) +benchmark(name, df, values, numRows, minNumIters) + } + + def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems large decimals" +val values = (1 to numItems).map(v => s"9223372036854775812.10539$v") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7))) +benchmark(name, df, values, numRows, minNumIters) + } + + def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems strings" +val values = (1 to numItems).map(n => s"'$n'") +val df = spark.range(1,
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717877#comment-16717877 ] ASF GitHub Bot commented on SPARK-26203: aokolnychyi commented on a change in pull request #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#discussion_r240766003 ## File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.benchmark.Benchmark +import org.apache.spark.sql.catalyst.expressions.In +import org.apache.spark.sql.catalyst.expressions.InSet +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark +import org.apache.spark.sql.functions.{array, struct} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +/** + * A benchmark that compares the performance of [[In]] and [[InSet]] expressions. + * + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/InSetBenchmark-results.txt". + * }}} + */ +object InSetBenchmark extends SqlBasedBenchmark { + + import spark.implicits._ + + def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems bytes" +val values = (1 to numItems).map(v => s"CAST($v AS tinyint)") +val df = spark.range(1, numRows).select($"id".cast(ByteType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems shorts" +val values = (1 to numItems).map(v => s"CAST($v AS smallint)") +val df = spark.range(1, numRows).select($"id".cast(ShortType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems ints" +val values = 1 to numItems +val df = spark.range(1, numRows).select($"id".cast(IntegerType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems longs" +val values = (1 to numItems).map(v => s"${v}L") +val df = spark.range(1, numRows).toDF("id") +benchmark(name, df, values, numRows, minNumIters) + } + + def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems floats" +val values = (1 to numItems).map(v => s"CAST($v AS float)") +val df = spark.range(1, numRows).select($"id".cast(FloatType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems doubles" +val values = 1.0 to numItems by 1.0 +val df = spark.range(1, numRows).select($"id".cast(DoubleType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems small decimals" +val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1))) +benchmark(name, df, values, numRows, minNumIters) + } + + def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems large decimals" +val values = (1 to numItems).map(v => s"9223372036854775812.10539$v") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7))) +benchmark(name, df, values, numRows, minNumIters) + } + + def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems strings" +val values = (1 to numItems).map(n => s"'$n'") +val df = spark.range(1,
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717869#comment-16717869 ] ASF GitHub Bot commented on SPARK-26203: dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#discussion_r240762863 ## File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.benchmark.Benchmark +import org.apache.spark.sql.catalyst.expressions.In +import org.apache.spark.sql.catalyst.expressions.InSet +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark +import org.apache.spark.sql.functions.{array, struct} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +/** + * A benchmark that compares the performance of [[In]] and [[InSet]] expressions. + * + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/InSetBenchmark-results.txt". + * }}} + */ +object InSetBenchmark extends SqlBasedBenchmark { + + import spark.implicits._ + + def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems bytes" +val values = (1 to numItems).map(v => s"CAST($v AS tinyint)") +val df = spark.range(1, numRows).select($"id".cast(ByteType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems shorts" +val values = (1 to numItems).map(v => s"CAST($v AS smallint)") +val df = spark.range(1, numRows).select($"id".cast(ShortType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems ints" +val values = 1 to numItems +val df = spark.range(1, numRows).select($"id".cast(IntegerType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems longs" +val values = (1 to numItems).map(v => s"${v}L") +val df = spark.range(1, numRows).toDF("id") +benchmark(name, df, values, numRows, minNumIters) + } + + def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems floats" +val values = (1 to numItems).map(v => s"CAST($v AS float)") +val df = spark.range(1, numRows).select($"id".cast(FloatType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems doubles" +val values = 1.0 to numItems by 1.0 +val df = spark.range(1, numRows).select($"id".cast(DoubleType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems small decimals" +val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1))) +benchmark(name, df, values, numRows, minNumIters) + } + + def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems large decimals" +val values = (1 to numItems).map(v => s"9223372036854775812.10539$v") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7))) +benchmark(name, df, values, numRows, minNumIters) + } + + def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems strings" +val values = (1 to numItems).map(n => s"'$n'") +val df = spark.range(1,
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717866#comment-16717866 ] ASF GitHub Bot commented on SPARK-26203: aokolnychyi commented on a change in pull request #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#discussion_r240762608 ## File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.benchmark.Benchmark +import org.apache.spark.sql.catalyst.expressions.In +import org.apache.spark.sql.catalyst.expressions.InSet +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark +import org.apache.spark.sql.functions.{array, struct} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +/** + * A benchmark that compares the performance of [[In]] and [[InSet]] expressions. + * + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/InSetBenchmark-results.txt". + * }}} + */ +object InSetBenchmark extends SqlBasedBenchmark { + + import spark.implicits._ + + def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems bytes" +val values = (1 to numItems).map(v => s"CAST($v AS tinyint)") +val df = spark.range(1, numRows).select($"id".cast(ByteType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems shorts" +val values = (1 to numItems).map(v => s"CAST($v AS smallint)") +val df = spark.range(1, numRows).select($"id".cast(ShortType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems ints" +val values = 1 to numItems +val df = spark.range(1, numRows).select($"id".cast(IntegerType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems longs" +val values = (1 to numItems).map(v => s"${v}L") +val df = spark.range(1, numRows).toDF("id") +benchmark(name, df, values, numRows, minNumIters) + } + + def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems floats" +val values = (1 to numItems).map(v => s"CAST($v AS float)") +val df = spark.range(1, numRows).select($"id".cast(FloatType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems doubles" +val values = 1.0 to numItems by 1.0 +val df = spark.range(1, numRows).select($"id".cast(DoubleType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems small decimals" +val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1))) +benchmark(name, df, values, numRows, minNumIters) + } + + def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems large decimals" +val values = (1 to numItems).map(v => s"9223372036854775812.10539$v") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7))) +benchmark(name, df, values, numRows, minNumIters) + } + + def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems strings" +val values = (1 to numItems).map(n => s"'$n'") +val df = spark.range(1,
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717861#comment-16717861 ] ASF GitHub Bot commented on SPARK-26203: dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#discussion_r240761247 ## File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.benchmark.Benchmark +import org.apache.spark.sql.catalyst.expressions.In +import org.apache.spark.sql.catalyst.expressions.InSet +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark +import org.apache.spark.sql.functions.{array, struct} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +/** + * A benchmark that compares the performance of [[In]] and [[InSet]] expressions. + * + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/InSetBenchmark-results.txt". + * }}} + */ +object InSetBenchmark extends SqlBasedBenchmark { + + import spark.implicits._ + + def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems bytes" +val values = (1 to numItems).map(v => s"CAST($v AS tinyint)") +val df = spark.range(1, numRows).select($"id".cast(ByteType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems shorts" +val values = (1 to numItems).map(v => s"CAST($v AS smallint)") +val df = spark.range(1, numRows).select($"id".cast(ShortType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems ints" +val values = 1 to numItems +val df = spark.range(1, numRows).select($"id".cast(IntegerType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems longs" +val values = (1 to numItems).map(v => s"${v}L") +val df = spark.range(1, numRows).toDF("id") +benchmark(name, df, values, numRows, minNumIters) + } + + def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems floats" +val values = (1 to numItems).map(v => s"CAST($v AS float)") +val df = spark.range(1, numRows).select($"id".cast(FloatType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems doubles" +val values = 1.0 to numItems by 1.0 +val df = spark.range(1, numRows).select($"id".cast(DoubleType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems small decimals" +val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1))) +benchmark(name, df, values, numRows, minNumIters) + } + + def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems large decimals" +val values = (1 to numItems).map(v => s"9223372036854775812.10539$v") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7))) +benchmark(name, df, values, numRows, minNumIters) + } + + def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems strings" +val values = (1 to numItems).map(n => s"'$n'") +val df = spark.range(1,
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717859#comment-16717859 ] ASF GitHub Bot commented on SPARK-26203: dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#discussion_r240760944 ## File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.benchmark.Benchmark +import org.apache.spark.sql.catalyst.expressions.In +import org.apache.spark.sql.catalyst.expressions.InSet +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark +import org.apache.spark.sql.functions.{array, struct} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +/** + * A benchmark that compares the performance of [[In]] and [[InSet]] expressions. + * + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/InSetBenchmark-results.txt". + * }}} + */ +object InSetBenchmark extends SqlBasedBenchmark { + + import spark.implicits._ + + def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems bytes" +val values = (1 to numItems).map(v => s"CAST($v AS tinyint)") +val df = spark.range(1, numRows).select($"id".cast(ByteType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems shorts" +val values = (1 to numItems).map(v => s"CAST($v AS smallint)") +val df = spark.range(1, numRows).select($"id".cast(ShortType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems ints" +val values = 1 to numItems +val df = spark.range(1, numRows).select($"id".cast(IntegerType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems longs" +val values = (1 to numItems).map(v => s"${v}L") +val df = spark.range(1, numRows).toDF("id") +benchmark(name, df, values, numRows, minNumIters) + } + + def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems floats" +val values = (1 to numItems).map(v => s"CAST($v AS float)") +val df = spark.range(1, numRows).select($"id".cast(FloatType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems doubles" +val values = 1.0 to numItems by 1.0 +val df = spark.range(1, numRows).select($"id".cast(DoubleType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems small decimals" +val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1))) +benchmark(name, df, values, numRows, minNumIters) + } + + def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems large decimals" +val values = (1 to numItems).map(v => s"9223372036854775812.10539$v") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7))) +benchmark(name, df, values, numRows, minNumIters) + } + + def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems strings" +val values = (1 to numItems).map(n => s"'$n'") +val df = spark.range(1,
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717807#comment-16717807 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446326286 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717837#comment-16717837 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446328453 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99987/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717831#comment-16717831 ] ASF GitHub Bot commented on SPARK-26203: SparkQA removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446324303 **[Test build #99987 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99987/testReport)** for PR 23291 at commit [`6a5e992`](https://github.com/apache/spark/commit/6a5e992a3f940b7e85d11c5964086e3a6185e81c). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717832#comment-16717832 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446328447 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717828#comment-16717828 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446328447 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717827#comment-16717827 ] ASF GitHub Bot commented on SPARK-26203: SparkQA commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446328419 **[Test build #99987 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99987/testReport)** for PR 23291 at commit [`6a5e992`](https://github.com/apache/spark/commit/6a5e992a3f940b7e85d11c5964086e3a6185e81c). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717829#comment-16717829 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446328453 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99987/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717783#comment-16717783 ] ASF GitHub Bot commented on SPARK-26203: dongjoon-hyun commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446325194 Thank you for making the benchmark, @aokolnychyi . I'll take a look. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717808#comment-16717808 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446326295 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5985/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717809#comment-16717809 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446326286 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717810#comment-16717810 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446326295 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5985/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717779#comment-16717779 ] ASF GitHub Bot commented on SPARK-26203: SparkQA commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446324303 **[Test build #99987 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99987/testReport)** for PR 23291 at commit [`6a5e992`](https://github.com/apache/spark/commit/6a5e992a3f940b7e85d11c5964086e3a6185e81c). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717776#comment-16717776 ] ASF GitHub Bot commented on SPARK-26203: aokolnychyi opened a new pull request #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291 ## What changes were proposed in this pull request? This PR contains benchmarks for `In` and `InSet` expressions. They cover literals of different data types and will help us to decide where to integrate the switch-based logic for bytes/shorts/ints. As discussed in [PR-23171](https://github.com/apache/spark/pull/23171), one potential approach is to convert `In` to `InSet` if all elements are literals independently of data types and the number of elements. According to the results of this PR, we might want to keep the threshold for the number of elements. The if-else approach approach might be faster for some data types on a small number of elements (structs? arrays? small decimals?). The execution time for all benchmarks is around 4 minutes. ### byte / short / int / long Unless the number of items is really big, `InSet` is slower than `In` because of autoboxing . Interestingly, `In` scales worse on bytes/shorts than on ints/longs. For example, `InSet` starts to match the performance on around 50 bytes/shorts while this does not happen on the same number of ints/longs. This is a bit strange as shorts/bytes (e.g., `(byte) 1`, `(short) 2`) are represented as ints in the bytecode. ### float / double Use cases on floats/doubles also suffer from autoboxing. Therefore, `In` outperforms `InSet` on 10 elements. Similarly to shorts/bytes, `In` scales worse on floats/doubles than on ints/longs because the equality condition is more complicated (e.g., `java.lang.Float.isNaN(filter_valueArg_0) && java.lang.Float.isNaN(9.0F)) || filter_valueArg_0 == 9.0F`). ### decimal The reason why we have separate benchmarks for small and large decimals is that Spark might use longs to represent decimals in some cases. If this optimization happens, then `equals` will be nothing else as comparing longs. If this does not happen, Spark will create an instance of `scala.BigDecimal` and use it for comparisons. The latter is more expensive. `Decimal$hashCode` will always use `scala.BigDecimal$hashCode` even if the number is small enough to fit into a long variable. As a consequence, we see that use cases on small decimals are faster with `In` as they are using long comparisons under the hood. Large decimal values are always faster with `InSet`. ### string `UTF8String$equals` is not cheap. Therefore, `In` does not really outperform `InSet` as in previous use cases. ### timestamp / date Under the hood, timestamp/date values will be represented as long/int values. So, `In` allows us to avoid autoboxing. ### array Arrays are working as expected. `In` is faster on 5 elements while `InSet` is faster on 15 elements. The benchmarks are using `UnsafeArrayData`. ### struct `InSet` is always faster than `In` for structs. These benchmarks use `GenericInternalRow`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific