Re: [PR] Add short circuit [datafusion]
alamb commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2019123741
##
benchmarks/queries/clickbench/README.md:
##
@@ -120,13 +122,42 @@ LIMIT 10;
```
Results look like
-
+```
+-+-+---+--+--+--+
| ClientIP| WatchID | c | tmin | tp95 | tmax |
+-+-+---+--+--+--+
| 1611957945 | 6655575552203051303 | 2 | 0| 0| 0|
| -1402644643 | 8566928176839891583 | 2 | 0| 0| 0|
+-+-+---+--+--+--+
+```
+
+### Q6: How many social shares meet complex multi-stage filtering criteria?
+**Question**: What is the count of sharing actions from iPhone mobile users on
specific social networks, within common timezones, participating in seasonal
campaigns, with high screen resolutions and closely matched UTM parameters?
+**Important Query Properties**: Simple filter with high-selectivity, Costly
string matching, A large number of filters with high overhead are positioned
relatively later in the process
Review Comment:
š
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr {
fn evaluate(&self, batch: &RecordBatch) -> Result {
use arrow::compute::kernels::numeric::*;
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
+let data_type = arg.data_type();
+match (data_type, op) {
+(DataType::Boolean, Operator::And) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.false_count() == array.len();
+}
+}
+ColumnarValue::Scalar(scalar) => {
+if let ScalarValue::Boolean(Some(value)) = scalar {
+return !value;
+}
+}
+}
+false
+}
+(DataType::Boolean, Operator::Or) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.true_count() == array.len();
+}
+}
+ColumnarValue::Scalar(scalar) => {
+if let ScalarValue::Boolean(Some(value)) = scalar {
+return *value;
+}
+}
+}
+false
+}
+_ => false,
+}
+}
Review Comment:
Could you please also add some basic sqllogictests to ensure these code
paths are hit? Or show that they are covered by existing tests?
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr {
fn evaluate(&self, batch: &RecordBatch) -> Result {
use arrow::compute::kernels::numeric::*;
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
Review Comment:
If we find that this slows down some other performance we could also add
some sort of heuristic check to calling `false_count` / `true_count` -- like
for example if the rhs arg is "complex" (not a Column for example)
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr {
fn evaluate(&self, batch: &RecordBatch) -> Result {
use arrow::compute::kernels::numeric::*;
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
Review Comment:
Thanks @acking-you -- this looks great
Is there any reason to have this function defined in the `evaluate` method?
I think we could just make it a normal function and reduce the nesting level
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add short circuit [datafusion]
acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2762023923 Hello @alamb, the optimization SQL and documentation related to this PR have been completed, and all tests have passed. We may need to formally verify the performance, but I'm not quite sure how to do that (I can only run it locally). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add short circuit [datafusion]
acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2762014991 > You're absolutely right, I got my logic wrong there. Embarrasing! It's okay. You've also taught me a lot. When I first started writing this, I really didn't consider the case of null -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add short circuit [datafusion]
ctsk commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2762008525 You're absolutely right, I got my logic wrong there. Embarrasing! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add short circuit [datafusion]
acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2761813612 > I think one issue is that the short-circuit logic is not handling cases where the the `rhs` contains NULLs. E.g. `true OR NULL` needs to evaluate to `NULL` After taking a closer look, in fact, the situation you mentioned does not actually lead to the short-circuit optimization logic. The error Iām currently seeing is the use of `true_count() == 0` to determine if it is false, but in reality, it could also be null. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add short circuit [datafusion]
acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2761137454 > I think one issue is that the short-circuit logic is not handling cases where the the `rhs` contains NULLs. E.g. `true OR NULL` needs to evaluate to `NULL` Thank you very much for your hint, it will be very helpful for me to fix these tests! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add short circuit [datafusion]
ctsk commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2760574284 I think one issue is that the short-circuit logic is not handling cases where the the `rhs` contains NULLs. E.g. `true OR NULL` needs to evaluate to `NULL` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add short circuit [datafusion]
acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2759162126 Some tests failed, so let me take a look at what exactly is going on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
