[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user aokolnychyi commented on the issue: https://github.com/apache/spark/pull/23171 As @rxin said, if we introduce a separate expression for the switch-based approach, then we will need to modify other places. For example, `DataSourceStrategy$translateFilter`. So, integrating into `In` or `InSet` seems safer. I think we can move the switch-based logic to `InSet` and make `InSet` responsible for picking the most optimal execution path. We might need to modify the condition when we convert `In` to `InSet` as this will most likely depend on the underlying data type. I saw noticeable improvements starting from 5 elements when you compare the if-else approach to the switch-based one. Right now, the conversion happens for more than 10 elements. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23171 Basically logically there are only two expressions: In which handles arbitrary expressions, and InSet which handles expressions with literals. Both could work: (1) we provide two separate expressions for InSet, one using switch, and one using hashset, or (2) we just provide one InSet and internally in InSet have two implementations ... The downside with creating different expressions for the same logical expression is that potentially the downstream optimization rules would need to match more. On Mon, Dec 03, 2018 at 10:52 PM, DB Tsai < notificati...@github.com > wrote: > > > > @ rxin ( https://github.com/rxin ) switch in Java is still significantly > faster than hash set even without boxing / unboxing problems when the > number of elements are small. We were thinking about to have two > implementations in InSet , and pick up switch if the number of elements are > small, or otherwise pick up hash set one. But this is the same complexity > as having two implements in In as this PR. > > > > @ cloud-fan ( https://github.com/cloud-fan ) do you suggest to create an OptimizeIn > which has switch and hash set implementations based on the length of the > elements and remove InSet ? Basically, what we were thinking above. > > > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub ( > https://github.com/apache/spark/pull/23171#issuecomment-443991336 ) , or mute > the thread ( > https://github.com/notifications/unsubscribe-auth/AATvPKtGyx5jWxgtO1y5WsiXYDAQqRQ4ks5u1hvJgaJpZM4Y4P4J > ). > > > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23171 @rxin `switch` in Java is still significantly faster than hash set even without boxing / unboxing problems when the number of elements are small. We were thinking about to have two implementations in `InSet`, and pick up `switch` if the number of elements are small, or otherwise pick up hash set one. But this is the same complexity as having two implements in `In` as this PR. @cloud-fan do you suggest to create an `OptimizeIn` which has `switch` and hash set implementations based on the length of the elements and remove `InSet`? Basically, what we were thinking above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23171 How about, we create an `OptimizedIn`, and convert `In` to `OptimizedIn` if the list is all literals? `OptimizedIn` will pick `switch` or hash set based on the length of the list. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23171 I thought InSwitch logically is the same as InSet, in which all the child expressions are literals? On Mon, Dec 03, 2018 at 8:38 PM, Wenchen Fan < notificati...@github.com > wrote: > > > > I think InSet is not an optimized version of In , but just a way to > separate the implementation for different conditions (the length of the > list). Maybe we should do the same thing here, create a InSwitch and > convert In to it when meeting some conditions. One problem is, In and InSwitch > is same in the interpreted version, maybe we should create a base class > for them. > > > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub ( > https://github.com/apache/spark/pull/23171#issuecomment-443968486 ) , or mute > the thread ( > https://github.com/notifications/unsubscribe-auth/AATvPDTQic0Ii5UD40m_Uj5kMVy4pNExks5u1fxPgaJpZM4Y4P4J > ). > > > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23171 I think `InSet` is not an optimized version of `In`, but just a way to separate the implementation for different conditions (the length of the list). Maybe we should do the same thing here, create a `InSwitch` and convert `In` to it when meeting some conditions. One problem is, `In` and `InSwitch` is same in the interpreted version, maybe we should create a base class for them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23171 That probably means we should just optimize InSet to have the switch version though? Rather than do it in In? On Mon, Dec 03, 2018 at 8:20 PM, Wenchen Fan < notificati...@github.com > wrote: > > > > @ rxin ( https://github.com/rxin ) I proposed the same thing before, but > one problem is that, we only convert In to InSet when the length of list > reaches the threshold. If the switch way is faster than hash set when the > list is small, it seems still worth to optimize In using switch. > > > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub ( > https://github.com/apache/spark/pull/23171#issuecomment-443965616 ) , or mute > the thread ( > https://github.com/notifications/unsubscribe-auth/AATvPEkrUFJuT4FI167cCI9b0nfv16V4ks5u1fgNgaJpZM4Y4P4J > ). > > > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23171 @rxin I proposed the same thing before, but one problem is that, we only convert `In` to `InSet` when the length of list reaches the threshold. If the `switch` way is faster than hash set when the list is small, it seems still worth to optimize `In` using `switch`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23171 I'm not a big fan of making the physical implementation of an expression very different depending on the situation. Why can't we just make InSet efficient and convert these cases to that? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/23171 yes @aokolnychyi , I agree that the work can be done later (not in the scope of this PR). We can maybe just open a new JIRA about it so we won't forget. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user aokolnychyi commented on the issue: https://github.com/apache/spark/pull/23171 To sum up, I would set the goal of this PR is to make `In` expressions as efficient as possible for bytes/shorts/ints. Then we can do benchmarks for `In` vs `InSet` in [SPARK-26203](https://issues.apache.org/jira/browse/SPARK-26203) and try to come up with a solution for `InSet` in [SPARK-26204](https://issues.apache.org/jira/browse/SPARK-26204). By the time we solve [SPARK-26204](https://issues.apache.org/jira/browse/SPARK-26204), we will have a clear undestanding of pros and cons in `In` and `InSet` and would be able to determine the right thresholds. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user aokolnychyi commented on the issue: https://github.com/apache/spark/pull/23171 @dbtsai @mgaido91 I think we can come back to this question once [SPARK-26203](https://issues.apache.org/jira/browse/SPARK-26203) is resolved. That JIRA will give us enough information about each data type. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/23171 @dbtsai I see, it would be great, though, to check which is this threshold. My understanding is that the current solution has better performance even for several hundreds of items. If this number is some thousands and since this depends on the datatype (so it is hard to control by the users with a single config), it is arguable which is the best solution: I don't think it is very common to have thousands of elements, while for lower numbers (more common) we would use the less efficient solution. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23171 @cloud-fan as @aokolnychyi said, `switch` will still be faster than optimized `Set` without autoboxing when the number of elements are small. As a result, this PR is still very useful. @mgaido91 `InSet` can be better when we implement properly without autoboxing for large numbers of elements controlled by `spark.sql.optimizer.inSetConversionThreshold`. Also, generating `In` with huge lists can cause a compile exception due to the method size limit as you pointed out. As a result, we should convert it into `InSet` for large set. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user aokolnychyi commented on the issue: https://github.com/apache/spark/pull/23171 @cloud-fan, yeah, letâs see if this PR is useful. The original idea wasnât to avoid fixing autoboxing in `InSet`. `In` was tested on 250 numbers to prove O(1) time complexity on compact values and outline problems with `InSet`. After this change, `In` will be faster than `InSet` but this is not the goal. Overall, the intent was to have a tiny and straightforward change that would optimize `In` expressions even if the number of elements is less than âspark.sql.optimizer.inSetConversionThresholdâ and Spark does not use `InSet`. Once we solve autoboxing issues in `InSet`, we would need to benchmark against this approach. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23171 I'm wondering if this is still useful after we fix the boxing issue in `InSet`. We can write a binary hash set for primitive types, like `LongToUnsafeRowMap`, which should have better performance. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/23171 Also cc @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23171 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23171 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99393/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23171 **[Test build #99393 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99393/testReport)** for PR 23171 at commit [`1477f10`](https://github.com/apache/spark/commit/1477f101951ed39b96f884ddd2d451e164c0cc43). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23171 The approach looks great, and can significantly improve the performance. For Long, I agree that we should also implement binary search approach for `O(logn)` look up. Wondering which one will be faster, binary search using arrays or rewrite the `if-else` in binary search form. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user aokolnychyi commented on the issue: https://github.com/apache/spark/pull/23171 @gatorsmile @cloud-fan @dongjoon-hyun @viirya --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23171 **[Test build #99393 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99393/testReport)** for PR 23171 at commit [`1477f10`](https://github.com/apache/spark/commit/1477f101951ed39b96f884ddd2d451e164c0cc43). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23171 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23171 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5467/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org