[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-04 Thread aokolnychyi
Github user aokolnychyi commented on the issue:

https://github.com/apache/spark/pull/23171
  
As @rxin said, if we introduce a separate expression for the switch-based 
approach, then we will need to modify other places. For example, 
`DataSourceStrategy$translateFilter`. So, integrating into `In` or `InSet` 
seems safer.

I think we can move the switch-based logic to `InSet` and make `InSet` 
responsible for picking the most optimal execution path. We might need to 
modify the condition when we convert `In` to `InSet` as this will most likely 
depend on the underlying data type. I saw noticeable improvements starting from 
5 elements when you compare the if-else approach to the switch-based one. Right 
now, the conversion happens for more than 10 elements.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23171
  
Basically logically there are only two expressions: In which handles 
arbitrary expressions, and InSet which handles expressions with literals. Both 
could work: (1) we provide two separate expressions for InSet, one using 
switch, and one using hashset, or (2) we just provide one InSet and internally 
in InSet have two implementations ... 

The downside with creating different expressions for the same logical 
expression is that potentially the downstream optimization rules would need to 
match more.

On Mon, Dec 03, 2018 at 10:52 PM, DB Tsai < notificati...@github.com > 
wrote:

> 
> 
> 
> @ rxin ( https://github.com/rxin ) switch in Java is still significantly
> faster than hash set even without boxing / unboxing problems when the
> number of elements are small. We were thinking about to have two
> implementations in InSet , and pick up switch if the number of elements 
are
> small, or otherwise pick up hash set one. But this is the same complexity
> as having two implements in In as this PR.
> 
> 
> 
> @ cloud-fan ( https://github.com/cloud-fan ) do you suggest to create an 
OptimizeIn
> which has switch and hash set implementations based on the length of the
> elements and remove InSet ? Basically, what we were thinking above.
> 
> 
> 
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub (
> https://github.com/apache/spark/pull/23171#issuecomment-443991336 ) , or 
mute
> the thread (
> 
https://github.com/notifications/unsubscribe-auth/AATvPKtGyx5jWxgtO1y5WsiXYDAQqRQ4ks5u1hvJgaJpZM4Y4P4J
> ).
> 
> 
>


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/23171
  
@rxin `switch` in Java is still significantly faster than hash set even 
without boxing / unboxing problems when the number of elements are small. We 
were thinking about to have two implementations in `InSet`, and pick up 
`switch` if the number of elements are small, or otherwise pick up hash set 
one. But this is the same complexity as having two implements in `In` as this 
PR. 

@cloud-fan do you suggest to create an `OptimizeIn` which has `switch` and 
hash set implementations based on the length of the elements and remove 
`InSet`? Basically, what we were thinking above.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/23171
  
How about, we create an `OptimizedIn`, and convert `In` to `OptimizedIn` if 
the list is all literals? `OptimizedIn` will pick `switch` or hash set based on 
the length of the list.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23171
  
I thought InSwitch logically is the same as InSet, in which all the child 
expressions are literals?

On Mon, Dec 03, 2018 at 8:38 PM, Wenchen Fan < notificati...@github.com > 
wrote:

> 
> 
> 
> I think InSet is not an optimized version of In , but just a way to
> separate the implementation for different conditions (the length of the
> list). Maybe we should do the same thing here, create a InSwitch and
> convert In to it when meeting some conditions. One problem is, In and 
InSwitch
> is same in the interpreted version, maybe we should create a base class
> for them.
> 
> 
> 
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub (
> https://github.com/apache/spark/pull/23171#issuecomment-443968486 ) , or 
mute
> the thread (
> 
https://github.com/notifications/unsubscribe-auth/AATvPDTQic0Ii5UD40m_Uj5kMVy4pNExks5u1fxPgaJpZM4Y4P4J
> ).
> 
> 
>


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/23171
  
I think `InSet` is not an optimized version of `In`, but just a way to 
separate the implementation for different conditions (the length of the list). 
Maybe we should do the same thing here, create a `InSwitch` and convert `In` to 
it when meeting some conditions. One problem is, `In` and `InSwitch` is same in 
the interpreted version, maybe we should create a base class for them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23171
  
That probably means we should just optimize InSet to have the switch 
version though? Rather than do it in In?

On Mon, Dec 03, 2018 at 8:20 PM, Wenchen Fan < notificati...@github.com > 
wrote:

> 
> 
> 
> @ rxin ( https://github.com/rxin ) I proposed the same thing before, but
> one problem is that, we only convert In to InSet when the length of list
> reaches the threshold. If the switch way is faster than hash set when the
> list is small, it seems still worth to optimize In using switch.
> 
> 
> 
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub (
> https://github.com/apache/spark/pull/23171#issuecomment-443965616 ) , or 
mute
> the thread (
> 
https://github.com/notifications/unsubscribe-auth/AATvPEkrUFJuT4FI167cCI9b0nfv16V4ks5u1fgNgaJpZM4Y4P4J
> ).
> 
> 
>


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/23171
  
@rxin I proposed the same thing before, but one problem is that, we only 
convert `In` to `InSet` when the length of list reaches the threshold. If the 
`switch` way is faster than hash set when the list is small, it seems still 
worth to optimize `In` using `switch`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23171
  
I'm not a big fan of making the physical implementation of an expression 
very different depending on the situation. Why can't we just make InSet 
efficient and convert these cases to that?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-29 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/23171
  
yes @aokolnychyi , I agree that the work can be done later (not in the 
scope of this PR). We can maybe just open a new JIRA about it so we won't 
forget.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-29 Thread aokolnychyi
Github user aokolnychyi commented on the issue:

https://github.com/apache/spark/pull/23171
  
To sum up, I would set the goal of this PR is to make `In` expressions as 
efficient as possible for bytes/shorts/ints. Then we can do benchmarks for `In` 
vs `InSet` in [SPARK-26203](https://issues.apache.org/jira/browse/SPARK-26203) 
and try to come up with a solution for `InSet` in 
[SPARK-26204](https://issues.apache.org/jira/browse/SPARK-26204). By the time 
we solve [SPARK-26204](https://issues.apache.org/jira/browse/SPARK-26204), we 
will have a clear undestanding of pros and cons in `In` and `InSet` and would 
be able to determine the right thresholds.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-29 Thread aokolnychyi
Github user aokolnychyi commented on the issue:

https://github.com/apache/spark/pull/23171
  
@dbtsai @mgaido91 I think we can come back to this question once 
[SPARK-26203](https://issues.apache.org/jira/browse/SPARK-26203) is resolved. 
That JIRA will give us enough information about each data type.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-29 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/23171
  
@dbtsai I see, it would be great, though, to check which is this threshold. 
My understanding is that the current solution has better performance even for 
several hundreds of items. If this number is some thousands and since this 
depends on the datatype (so it is hard to control by the users with a single 
config), it is arguable which is the best solution: I don't think it is very 
common to have thousands of elements, while for lower numbers (more common) we 
would use the less efficient solution. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-29 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/23171
  
@cloud-fan as @aokolnychyi said, `switch` will still be faster than 
optimized `Set` without autoboxing when the number of elements are small. As a 
result, this PR is still very useful.  

@mgaido91 `InSet` can be better when we implement properly without 
autoboxing for large numbers of elements controlled by 
`spark.sql.optimizer.inSetConversionThreshold`. Also, generating `In` with huge 
lists can cause a compile exception due to the method size limit as you pointed 
out. As a result, we should convert it into `InSet` for large set.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-29 Thread aokolnychyi
Github user aokolnychyi commented on the issue:

https://github.com/apache/spark/pull/23171
  
@cloud-fan, yeah, let’s see if this PR is useful.

The original idea wasn’t to avoid fixing autoboxing in `InSet`. `In` was 
tested on 250 numbers to prove O(1) time complexity on compact values and 
outline problems with `InSet`. After this change, `In` will be faster than 
`InSet` but this is not the goal. Overall, the intent was to have a tiny and 
straightforward change that would optimize `In` expressions even if the number 
of elements is less than “spark.sql.optimizer.inSetConversionThreshold” and 
Spark does not use `InSet`.

Once we solve autoboxing issues in `InSet`, we would need to benchmark 
against this approach. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/23171
  
I'm wondering if this is still useful after we fix the boxing issue in 
`InSet`. We can write a binary hash set for primitive types, like 
`LongToUnsafeRowMap`, which should have better performance.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/23171
  
Also cc @ueshin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23171
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23171
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99393/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23171
  
**[Test build #99393 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99393/testReport)**
 for PR 23171 at commit 
[`1477f10`](https://github.com/apache/spark/commit/1477f101951ed39b96f884ddd2d451e164c0cc43).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/23171
  
The approach looks great, and can significantly improve the performance. 
For Long, I agree that we should also implement binary search approach for 
`O(logn)` look up. 

Wondering which one will be faster, binary search using arrays or rewrite 
the `if-else` in binary search form.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread aokolnychyi
Github user aokolnychyi commented on the issue:

https://github.com/apache/spark/pull/23171
  
@gatorsmile @cloud-fan @dongjoon-hyun @viirya 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23171
  
**[Test build #99393 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99393/testReport)**
 for PR 23171 at commit 
[`1477f10`](https://github.com/apache/spark/commit/1477f101951ed39b96f884ddd2d451e164c0cc43).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23171
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23171
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5467/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org