[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718192#comment-16718192
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446404202
 
 
   Merged build finished. Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718197#comment-16718197
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446404207
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/3/
   Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718194#comment-16718194
 ] 

ASF GitHub Bot commented on SPARK-26203:


SparkQA removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446401531
 
 
   **[Test build #3 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/3/testReport)**
 for PR 23291 at commit 
[`987bea4`](https://github.com/apache/spark/commit/987bea48350ed2e3862b965e07d5d5335e1d86c2).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718193#comment-16718193
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446404207
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/3/
   Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718195#comment-16718195
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446404202
 
 
   Merged build finished. Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718191#comment-16718191
 ] 

ASF GitHub Bot commented on SPARK-26203:


SparkQA commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of 
In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446404185
 
 
   **[Test build #3 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/3/testReport)**
 for PR 23291 at commit 
[`987bea4`](https://github.com/apache/spark/commit/987bea48350ed2e3862b965e07d5d5335e1d86c2).
* This patch **fails to generate documentation**.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718177#comment-16718177
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446401600
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5989/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718176#comment-16718176
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446401593
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718174#comment-16718174
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446401600
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5989/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718173#comment-16718173
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446401593
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718172#comment-16718172
 ] 

ASF GitHub Bot commented on SPARK-26203:


SparkQA commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of 
In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446401531
 
 
   **[Test build #3 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/3/testReport)**
 for PR 23291 at commit 
[`987bea4`](https://github.com/apache/spark/commit/987bea48350ed2e3862b965e07d5d5335e1d86c2).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718056#comment-16718056
 ] 

ASF GitHub Bot commented on SPARK-26203:


dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] 
Benchmark performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#discussion_r240804589
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.catalyst.expressions.In
+import org.apache.spark.sql.catalyst.expressions.InSet
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions.{array, struct}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+/**
+ * A benchmark that compares the performance of [[In]] and [[InSet]] 
expressions.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class  
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *  Results will be written to "benchmarks/InSetBenchmark-results.txt".
+ * }}}
+ */
+object InSetBenchmark extends SqlBasedBenchmark {
+
+  import spark.implicits._
+
+  def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems bytes"
+val values = (1 to numItems).map(v => s"CAST($v AS tinyint)")
+val df = spark.range(1, numRows).select($"id".cast(ByteType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems shorts"
+val values = (1 to numItems).map(v => s"CAST($v AS smallint)")
+val df = spark.range(1, numRows).select($"id".cast(ShortType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems ints"
+val values = 1 to numItems
+val df = spark.range(1, numRows).select($"id".cast(IntegerType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems longs"
+val values = (1 to numItems).map(v => s"${v}L")
+val df = spark.range(1, numRows).toDF("id")
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems floats"
+val values = (1 to numItems).map(v => s"CAST($v AS float)")
+val df = spark.range(1, numRows).select($"id".cast(FloatType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems doubles"
+val values = 1.0 to numItems by 1.0
+val df = spark.range(1, numRows).select($"id".cast(DoubleType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems small decimals"
+val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems large decimals"
+val values = (1 to numItems).map(v => s"9223372036854775812.10539$v")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems strings"
+val values = (1 to numItems).map(n => s"'$n'")
+val df = spark.range(1, 

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718042#comment-16718042
 ] 

ASF GitHub Bot commented on SPARK-26203:


aokolnychyi commented on a change in pull request #23291: [SPARK-26203][SQL] 
Benchmark performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#discussion_r240800556
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.catalyst.expressions.In
+import org.apache.spark.sql.catalyst.expressions.InSet
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions.{array, struct}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+/**
+ * A benchmark that compares the performance of [[In]] and [[InSet]] 
expressions.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class  
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *  Results will be written to "benchmarks/InSetBenchmark-results.txt".
+ * }}}
+ */
+object InSetBenchmark extends SqlBasedBenchmark {
+
+  import spark.implicits._
+
+  def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems bytes"
+val values = (1 to numItems).map(v => s"CAST($v AS tinyint)")
+val df = spark.range(1, numRows).select($"id".cast(ByteType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems shorts"
+val values = (1 to numItems).map(v => s"CAST($v AS smallint)")
+val df = spark.range(1, numRows).select($"id".cast(ShortType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems ints"
+val values = 1 to numItems
+val df = spark.range(1, numRows).select($"id".cast(IntegerType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems longs"
+val values = (1 to numItems).map(v => s"${v}L")
+val df = spark.range(1, numRows).toDF("id")
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems floats"
+val values = (1 to numItems).map(v => s"CAST($v AS float)")
+val df = spark.range(1, numRows).select($"id".cast(FloatType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems doubles"
+val values = 1.0 to numItems by 1.0
+val df = spark.range(1, numRows).select($"id".cast(DoubleType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems small decimals"
+val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems large decimals"
+val values = (1 to numItems).map(v => s"9223372036854775812.10539$v")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems strings"
+val values = (1 to numItems).map(n => s"'$n'")
+val df = spark.range(1, 

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718019#comment-16718019
 ] 

ASF GitHub Bot commented on SPARK-26203:


aokolnychyi commented on a change in pull request #23291: [SPARK-26203][SQL] 
Benchmark performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#discussion_r240795529
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.catalyst.expressions.In
+import org.apache.spark.sql.catalyst.expressions.InSet
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions.{array, struct}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+/**
+ * A benchmark that compares the performance of [[In]] and [[InSet]] 
expressions.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class  
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *  Results will be written to "benchmarks/InSetBenchmark-results.txt".
+ * }}}
+ */
+object InSetBenchmark extends SqlBasedBenchmark {
+
+  import spark.implicits._
+
+  def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems bytes"
+val values = (1 to numItems).map(v => s"CAST($v AS tinyint)")
+val df = spark.range(1, numRows).select($"id".cast(ByteType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems shorts"
+val values = (1 to numItems).map(v => s"CAST($v AS smallint)")
+val df = spark.range(1, numRows).select($"id".cast(ShortType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems ints"
+val values = 1 to numItems
+val df = spark.range(1, numRows).select($"id".cast(IntegerType))
+benchmark(name, df, values, numRows, minNumIters)
 
 Review comment:
   We can do this. Shall we rename `intBenchmark` to `runIntBenchmark` then? 
There is no consistency in existing benchmarks, unfortunately. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message 

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718022#comment-16718022
 ] 

ASF GitHub Bot commented on SPARK-26203:


dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] 
Benchmark performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#discussion_r240796620
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.catalyst.expressions.In
+import org.apache.spark.sql.catalyst.expressions.InSet
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions.{array, struct}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+/**
+ * A benchmark that compares the performance of [[In]] and [[InSet]] 
expressions.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class  
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *  Results will be written to "benchmarks/InSetBenchmark-results.txt".
+ * }}}
+ */
+object InSetBenchmark extends SqlBasedBenchmark {
+
+  import spark.implicits._
+
+  def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems bytes"
+val values = (1 to numItems).map(v => s"CAST($v AS tinyint)")
+val df = spark.range(1, numRows).select($"id".cast(ByteType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems shorts"
+val values = (1 to numItems).map(v => s"CAST($v AS smallint)")
+val df = spark.range(1, numRows).select($"id".cast(ShortType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems ints"
+val values = 1 to numItems
+val df = spark.range(1, numRows).select($"id".cast(IntegerType))
+benchmark(name, df, values, numRows, minNumIters)
 
 Review comment:
   +1 for renaming. Yep. We overlooked the naming consistency in the previous 
benchmarks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717886#comment-16717886
 ] 

ASF GitHub Bot commented on SPARK-26203:


dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] 
Benchmark performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#discussion_r240769809
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.catalyst.expressions.In
+import org.apache.spark.sql.catalyst.expressions.InSet
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions.{array, struct}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+/**
+ * A benchmark that compares the performance of [[In]] and [[InSet]] 
expressions.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class  
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *  Results will be written to "benchmarks/InSetBenchmark-results.txt".
+ * }}}
+ */
+object InSetBenchmark extends SqlBasedBenchmark {
+
+  import spark.implicits._
+
+  def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems bytes"
+val values = (1 to numItems).map(v => s"CAST($v AS tinyint)")
+val df = spark.range(1, numRows).select($"id".cast(ByteType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems shorts"
+val values = (1 to numItems).map(v => s"CAST($v AS smallint)")
+val df = spark.range(1, numRows).select($"id".cast(ShortType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems ints"
+val values = 1 to numItems
+val df = spark.range(1, numRows).select($"id".cast(IntegerType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems longs"
+val values = (1 to numItems).map(v => s"${v}L")
+val df = spark.range(1, numRows).toDF("id")
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems floats"
+val values = (1 to numItems).map(v => s"CAST($v AS float)")
+val df = spark.range(1, numRows).select($"id".cast(FloatType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems doubles"
+val values = 1.0 to numItems by 1.0
+val df = spark.range(1, numRows).select($"id".cast(DoubleType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems small decimals"
+val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems large decimals"
+val values = (1 to numItems).map(v => s"9223372036854775812.10539$v")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems strings"
+val values = (1 to numItems).map(n => s"'$n'")
+val df = spark.range(1, 

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717878#comment-16717878
 ] 

ASF GitHub Bot commented on SPARK-26203:


dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] 
Benchmark performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#discussion_r240766306
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.catalyst.expressions.In
+import org.apache.spark.sql.catalyst.expressions.InSet
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions.{array, struct}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+/**
+ * A benchmark that compares the performance of [[In]] and [[InSet]] 
expressions.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class  
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *  Results will be written to "benchmarks/InSetBenchmark-results.txt".
+ * }}}
+ */
+object InSetBenchmark extends SqlBasedBenchmark {
+
+  import spark.implicits._
+
+  def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems bytes"
+val values = (1 to numItems).map(v => s"CAST($v AS tinyint)")
+val df = spark.range(1, numRows).select($"id".cast(ByteType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems shorts"
+val values = (1 to numItems).map(v => s"CAST($v AS smallint)")
+val df = spark.range(1, numRows).select($"id".cast(ShortType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems ints"
+val values = 1 to numItems
+val df = spark.range(1, numRows).select($"id".cast(IntegerType))
+benchmark(name, df, values, numRows, minNumIters)
 
 Review comment:
   nit. Shall we invoke `run` here intead of `intBenchmark(...).run()`? It 
seems to be repeated multiple times.
   ```
   - benchmark(name, df, values, numRows, minNumIters)
   + benchmark(name, df, values, numRows, minNumIters).run()
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do 

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717879#comment-16717879
 ] 

ASF GitHub Bot commented on SPARK-26203:


dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] 
Benchmark performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#discussion_r240766592
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.catalyst.expressions.In
+import org.apache.spark.sql.catalyst.expressions.InSet
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions.{array, struct}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+/**
+ * A benchmark that compares the performance of [[In]] and [[InSet]] 
expressions.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class  
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *  Results will be written to "benchmarks/InSetBenchmark-results.txt".
+ * }}}
+ */
+object InSetBenchmark extends SqlBasedBenchmark {
+
+  import spark.implicits._
+
+  def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
 
 Review comment:
   `def` -> `private def`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717865#comment-16717865
 ] 

ASF GitHub Bot commented on SPARK-26203:


aokolnychyi commented on a change in pull request #23291: [SPARK-26203][SQL] 
Benchmark performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#discussion_r240762538
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.catalyst.expressions.In
+import org.apache.spark.sql.catalyst.expressions.InSet
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions.{array, struct}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+/**
+ * A benchmark that compares the performance of [[In]] and [[InSet]] 
expressions.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class  
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *  Results will be written to "benchmarks/InSetBenchmark-results.txt".
+ * }}}
+ */
+object InSetBenchmark extends SqlBasedBenchmark {
+
+  import spark.implicits._
+
+  def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems bytes"
+val values = (1 to numItems).map(v => s"CAST($v AS tinyint)")
+val df = spark.range(1, numRows).select($"id".cast(ByteType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems shorts"
+val values = (1 to numItems).map(v => s"CAST($v AS smallint)")
+val df = spark.range(1, numRows).select($"id".cast(ShortType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems ints"
+val values = 1 to numItems
+val df = spark.range(1, numRows).select($"id".cast(IntegerType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems longs"
+val values = (1 to numItems).map(v => s"${v}L")
+val df = spark.range(1, numRows).toDF("id")
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems floats"
+val values = (1 to numItems).map(v => s"CAST($v AS float)")
+val df = spark.range(1, numRows).select($"id".cast(FloatType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems doubles"
+val values = 1.0 to numItems by 1.0
+val df = spark.range(1, numRows).select($"id".cast(DoubleType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems small decimals"
+val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems large decimals"
+val values = (1 to numItems).map(v => s"9223372036854775812.10539$v")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems strings"
+val values = (1 to numItems).map(n => s"'$n'")
+val df = spark.range(1, 

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717877#comment-16717877
 ] 

ASF GitHub Bot commented on SPARK-26203:


aokolnychyi commented on a change in pull request #23291: [SPARK-26203][SQL] 
Benchmark performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#discussion_r240766003
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.catalyst.expressions.In
+import org.apache.spark.sql.catalyst.expressions.InSet
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions.{array, struct}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+/**
+ * A benchmark that compares the performance of [[In]] and [[InSet]] 
expressions.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class  
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *  Results will be written to "benchmarks/InSetBenchmark-results.txt".
+ * }}}
+ */
+object InSetBenchmark extends SqlBasedBenchmark {
+
+  import spark.implicits._
+
+  def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems bytes"
+val values = (1 to numItems).map(v => s"CAST($v AS tinyint)")
+val df = spark.range(1, numRows).select($"id".cast(ByteType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems shorts"
+val values = (1 to numItems).map(v => s"CAST($v AS smallint)")
+val df = spark.range(1, numRows).select($"id".cast(ShortType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems ints"
+val values = 1 to numItems
+val df = spark.range(1, numRows).select($"id".cast(IntegerType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems longs"
+val values = (1 to numItems).map(v => s"${v}L")
+val df = spark.range(1, numRows).toDF("id")
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems floats"
+val values = (1 to numItems).map(v => s"CAST($v AS float)")
+val df = spark.range(1, numRows).select($"id".cast(FloatType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems doubles"
+val values = 1.0 to numItems by 1.0
+val df = spark.range(1, numRows).select($"id".cast(DoubleType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems small decimals"
+val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems large decimals"
+val values = (1 to numItems).map(v => s"9223372036854775812.10539$v")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems strings"
+val values = (1 to numItems).map(n => s"'$n'")
+val df = spark.range(1, 

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717869#comment-16717869
 ] 

ASF GitHub Bot commented on SPARK-26203:


dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] 
Benchmark performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#discussion_r240762863
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.catalyst.expressions.In
+import org.apache.spark.sql.catalyst.expressions.InSet
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions.{array, struct}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+/**
+ * A benchmark that compares the performance of [[In]] and [[InSet]] 
expressions.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class  
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *  Results will be written to "benchmarks/InSetBenchmark-results.txt".
+ * }}}
+ */
+object InSetBenchmark extends SqlBasedBenchmark {
+
+  import spark.implicits._
+
+  def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems bytes"
+val values = (1 to numItems).map(v => s"CAST($v AS tinyint)")
+val df = spark.range(1, numRows).select($"id".cast(ByteType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems shorts"
+val values = (1 to numItems).map(v => s"CAST($v AS smallint)")
+val df = spark.range(1, numRows).select($"id".cast(ShortType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems ints"
+val values = 1 to numItems
+val df = spark.range(1, numRows).select($"id".cast(IntegerType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems longs"
+val values = (1 to numItems).map(v => s"${v}L")
+val df = spark.range(1, numRows).toDF("id")
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems floats"
+val values = (1 to numItems).map(v => s"CAST($v AS float)")
+val df = spark.range(1, numRows).select($"id".cast(FloatType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems doubles"
+val values = 1.0 to numItems by 1.0
+val df = spark.range(1, numRows).select($"id".cast(DoubleType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems small decimals"
+val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems large decimals"
+val values = (1 to numItems).map(v => s"9223372036854775812.10539$v")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems strings"
+val values = (1 to numItems).map(n => s"'$n'")
+val df = spark.range(1, 

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717866#comment-16717866
 ] 

ASF GitHub Bot commented on SPARK-26203:


aokolnychyi commented on a change in pull request #23291: [SPARK-26203][SQL] 
Benchmark performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#discussion_r240762608
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.catalyst.expressions.In
+import org.apache.spark.sql.catalyst.expressions.InSet
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions.{array, struct}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+/**
+ * A benchmark that compares the performance of [[In]] and [[InSet]] 
expressions.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class  
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *  Results will be written to "benchmarks/InSetBenchmark-results.txt".
+ * }}}
+ */
+object InSetBenchmark extends SqlBasedBenchmark {
+
+  import spark.implicits._
+
+  def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems bytes"
+val values = (1 to numItems).map(v => s"CAST($v AS tinyint)")
+val df = spark.range(1, numRows).select($"id".cast(ByteType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems shorts"
+val values = (1 to numItems).map(v => s"CAST($v AS smallint)")
+val df = spark.range(1, numRows).select($"id".cast(ShortType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems ints"
+val values = 1 to numItems
+val df = spark.range(1, numRows).select($"id".cast(IntegerType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems longs"
+val values = (1 to numItems).map(v => s"${v}L")
+val df = spark.range(1, numRows).toDF("id")
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems floats"
+val values = (1 to numItems).map(v => s"CAST($v AS float)")
+val df = spark.range(1, numRows).select($"id".cast(FloatType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems doubles"
+val values = 1.0 to numItems by 1.0
+val df = spark.range(1, numRows).select($"id".cast(DoubleType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems small decimals"
+val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems large decimals"
+val values = (1 to numItems).map(v => s"9223372036854775812.10539$v")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems strings"
+val values = (1 to numItems).map(n => s"'$n'")
+val df = spark.range(1, 

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717861#comment-16717861
 ] 

ASF GitHub Bot commented on SPARK-26203:


dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] 
Benchmark performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#discussion_r240761247
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.catalyst.expressions.In
+import org.apache.spark.sql.catalyst.expressions.InSet
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions.{array, struct}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+/**
+ * A benchmark that compares the performance of [[In]] and [[InSet]] 
expressions.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class  
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *  Results will be written to "benchmarks/InSetBenchmark-results.txt".
+ * }}}
+ */
+object InSetBenchmark extends SqlBasedBenchmark {
+
+  import spark.implicits._
+
+  def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems bytes"
+val values = (1 to numItems).map(v => s"CAST($v AS tinyint)")
+val df = spark.range(1, numRows).select($"id".cast(ByteType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems shorts"
+val values = (1 to numItems).map(v => s"CAST($v AS smallint)")
+val df = spark.range(1, numRows).select($"id".cast(ShortType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems ints"
+val values = 1 to numItems
+val df = spark.range(1, numRows).select($"id".cast(IntegerType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems longs"
+val values = (1 to numItems).map(v => s"${v}L")
+val df = spark.range(1, numRows).toDF("id")
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems floats"
+val values = (1 to numItems).map(v => s"CAST($v AS float)")
+val df = spark.range(1, numRows).select($"id".cast(FloatType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems doubles"
+val values = 1.0 to numItems by 1.0
+val df = spark.range(1, numRows).select($"id".cast(DoubleType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems small decimals"
+val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems large decimals"
+val values = (1 to numItems).map(v => s"9223372036854775812.10539$v")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems strings"
+val values = (1 to numItems).map(n => s"'$n'")
+val df = spark.range(1, 

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717859#comment-16717859
 ] 

ASF GitHub Bot commented on SPARK-26203:


dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] 
Benchmark performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#discussion_r240760944
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.catalyst.expressions.In
+import org.apache.spark.sql.catalyst.expressions.InSet
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions.{array, struct}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+/**
+ * A benchmark that compares the performance of [[In]] and [[InSet]] 
expressions.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class  
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *  Results will be written to "benchmarks/InSetBenchmark-results.txt".
+ * }}}
+ */
+object InSetBenchmark extends SqlBasedBenchmark {
+
+  import spark.implicits._
+
+  def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems bytes"
+val values = (1 to numItems).map(v => s"CAST($v AS tinyint)")
+val df = spark.range(1, numRows).select($"id".cast(ByteType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems shorts"
+val values = (1 to numItems).map(v => s"CAST($v AS smallint)")
+val df = spark.range(1, numRows).select($"id".cast(ShortType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems ints"
+val values = 1 to numItems
+val df = spark.range(1, numRows).select($"id".cast(IntegerType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems longs"
+val values = (1 to numItems).map(v => s"${v}L")
+val df = spark.range(1, numRows).toDF("id")
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems floats"
+val values = (1 to numItems).map(v => s"CAST($v AS float)")
+val df = spark.range(1, numRows).select($"id".cast(FloatType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems doubles"
+val values = 1.0 to numItems by 1.0
+val df = spark.range(1, numRows).select($"id".cast(DoubleType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems small decimals"
+val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems large decimals"
+val values = (1 to numItems).map(v => s"9223372036854775812.10539$v")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems strings"
+val values = (1 to numItems).map(n => s"'$n'")
+val df = spark.range(1, 

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717807#comment-16717807
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446326286
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717837#comment-16717837
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446328453
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99987/
   Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717831#comment-16717831
 ] 

ASF GitHub Bot commented on SPARK-26203:


SparkQA removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446324303
 
 
   **[Test build #99987 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99987/testReport)**
 for PR 23291 at commit 
[`6a5e992`](https://github.com/apache/spark/commit/6a5e992a3f940b7e85d11c5964086e3a6185e81c).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717832#comment-16717832
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446328447
 
 
   Merged build finished. Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717828#comment-16717828
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446328447
 
 
   Merged build finished. Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717827#comment-16717827
 ] 

ASF GitHub Bot commented on SPARK-26203:


SparkQA commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of 
In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446328419
 
 
   **[Test build #99987 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99987/testReport)**
 for PR 23291 at commit 
[`6a5e992`](https://github.com/apache/spark/commit/6a5e992a3f940b7e85d11c5964086e3a6185e81c).
* This patch **fails to generate documentation**.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717829#comment-16717829
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446328453
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99987/
   Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717783#comment-16717783
 ] 

ASF GitHub Bot commented on SPARK-26203:


dongjoon-hyun commented on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446325194
 
 
   Thank you for making the benchmark, @aokolnychyi . I'll take a look.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717808#comment-16717808
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446326295
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5985/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717809#comment-16717809
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446326286
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717810#comment-16717810
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446326295
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5985/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717779#comment-16717779
 ] 

ASF GitHub Bot commented on SPARK-26203:


SparkQA commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of 
In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446324303
 
 
   **[Test build #99987 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99987/testReport)**
 for PR 23291 at commit 
[`6a5e992`](https://github.com/apache/spark/commit/6a5e992a3f940b7e85d11c5964086e3a6185e81c).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717776#comment-16717776
 ] 

ASF GitHub Bot commented on SPARK-26203:


aokolnychyi opened a new pull request #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291
 
 
   ## What changes were proposed in this pull request?
   
   This PR contains benchmarks for `In` and `InSet` expressions. They cover 
literals of different data types and will help us to decide where to integrate 
the switch-based logic for bytes/shorts/ints.
   
   As discussed in [PR-23171](https://github.com/apache/spark/pull/23171), one 
potential approach is to convert `In` to `InSet` if all elements are literals 
independently of data types and the number of elements. According to the 
results of this PR, we might want to keep the threshold for the number of 
elements. The if-else approach approach might be faster for some data types on 
a small number of elements (structs? arrays? small decimals?).
   
   The execution time for all benchmarks is around 4 minutes.
   
   ### byte / short / int / long
   
   Unless the number of items is really big, `InSet` is slower than `In` 
because of autoboxing . 
   
   Interestingly, `In` scales worse on bytes/shorts than on ints/longs. For 
example, `InSet` starts to match the performance on around 50 bytes/shorts 
while this does not happen on the same number of ints/longs. This is a bit 
strange as shorts/bytes (e.g., `(byte) 1`, `(short) 2`) are represented as ints 
in the bytecode.
   
   ### float / double
   
   Use cases on floats/doubles also suffer from autoboxing. Therefore, `In` 
outperforms `InSet` on 10 elements.
   
   Similarly to shorts/bytes, `In` scales worse on floats/doubles than on 
ints/longs because the equality condition is more complicated (e.g., 
`java.lang.Float.isNaN(filter_valueArg_0) && java.lang.Float.isNaN(9.0F)) || 
filter_valueArg_0 == 9.0F`). 
   
   ### decimal
   
   The reason why we have separate benchmarks for small and large decimals is 
that Spark might use longs to represent decimals in some cases.
   
   If this optimization happens, then `equals` will be nothing else as 
comparing longs. If this does not happen, Spark will create an instance of 
`scala.BigDecimal` and use it for comparisons. The latter is more expensive.
   
   `Decimal$hashCode` will always use `scala.BigDecimal$hashCode` even if the 
number is small enough to fit into a long variable. As a consequence, we see 
that use cases on small decimals are faster with `In` as they are using long 
comparisons under the hood. Large decimal values are always faster with `InSet`.
   
   ### string
   
   `UTF8String$equals` is not cheap. Therefore, `In` does not really outperform 
`InSet` as in previous use cases.
   
   ### timestamp / date
   
   Under the hood, timestamp/date values will be represented as long/int 
values. So, `In` allows us to avoid autoboxing.
   
   ### array
   
   Arrays are working as expected. `In` is faster on 5 elements while `InSet` 
is faster on 15 elements. The benchmarks are using `UnsafeArrayData`. 
   
   ### struct
   
   `InSet` is always faster than `In` for structs. These benchmarks use 
`GenericInternalRow`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific