[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14912 @hvanhovell OK. Let's see if we can have a proper CNF soon. Thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/14912 @viirya TBH this seems hacky to me and I'd rather not merge this. I think we should just focus on having proper CNF in the optimizer. I am sorry to disappoint you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14912 ping @cloud-fan @hvanhovell @srinathshankar Can you take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14912 ping @cloud-fan @hvanhovell @srinathshankar again, please take look if you have time. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14912 ping @cloud-fan @hvanhovell Can you review this if you have time? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14912 ping @cloud-fan @hvanhovell @srinathshankar again, would you please take a look this? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user nsyca commented on the issue: https://github.com/apache/spark/pull/14912 @viirya, I agree that we need a separate set of PRs to address the general problem. On your comment: "I think the goal to simplify a predicate such as (a > 10 || b > 2) && (a > 10 || c == 3) to (a > 10) || (b > 2 && c == 3), is to eliminate redundant filtering expressions running in Filter in execution time." My two cents: If that is the case, deferring the simplification to the point just right before the execution time would be an option to consider. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14912 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14912 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65298/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14912 **[Test build #65298 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65298/consoleFull)** for PR 14912 at commit [`f69473f`](https://github.com/apache/spark/commit/f69473fd3f09f8b11fe63eff07ab72dfce9fee96). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14912 **[Test build #65298 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65298/consoleFull)** for PR 14912 at commit [`f69473f`](https://github.com/apache/spark/commit/f69473fd3f09f8b11fe63eff07ab72dfce9fee96). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14912 ping @srinathshankar @cloud-fan @hvanhovell Can you help review this change? Some context here: Some predicates are unable to push down because: 1. Predicates are simplified to the form which is not able to push down `Filter` can't push down through `Filter`. So the predicates in `Filter` will be simplified to the form unable to push down in next optimizing round. This is this change wants to fix. This change triggers `PushdownPredicate` in `CombineFilters`. So combined predicates can be pushed down before `BooleanSimplification` rule. 2. Predicates are in the form unable to push down at the beginning We may need to come out an approach to maintain multiple forms of predicates which at least can benefit pushdown and expression simplification. This will leave to later PRs. Need more discussion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14912 @nsyca Thanks for your detailed comment. I would like to leave the decision of predicate transformation to later PRs, as this PR is not motivated by this. I think to simplify a predicate such as `(a > 10 || b > 2) && (a > 10 || c == 3)` to `(a > 10) || (b > 2 && c == 3)`, is to eliminate redundant filtering expressions needed to run in `Filter` in execution time. As I said in before comment, my first opinion is not to complicate the predicate handling too much. We can keep a form of predicate which benefits predicate pushdown most, I guess the form should be CNF. We can also keep the simplified form of predicate which is better for execution in `Filter`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user nsyca commented on the issue: https://github.com/apache/spark/pull/14912 Thanks, @gatorsmile, for mentioning me. I will try my best to comment on this thread. Disclaimer: I have not looked at the existing code manipulating predicates/expressions in Spark. Nor have I the code in this PR. I am writing my comment here based solely on the comments I read in this PR. One of the goals of predicate transformation, in general, is to aid the predicate pushdown. If a new form of a predicate, or a derived form of a superset of a predicate is to be generated, it should be because there is a potential the new form or the derived form can be pushed down further the plan. Another goal of the transformation is because the new form has a potential to be simplified further. Taking the example of ``(a > 10 || b > 2) && (a > 10 || c == 3)``, I don't see any benefit of transforming to ``(a > 10) || (b > 2 && c == 3)`` as it will form a disjunctive predicate. If only ``b == c`` by transitivity rule then we may want to do that in order to simplify further to ``(a > 10 || c == 3`` (because ``b == c`` and ``c > 2 && c == 3`` can be reduced to ``c == 3``. The most benefit in the topic of predicate transformation is the equality transitivity property as equality predicates are commonly used in SQL queries. I remember there were a few JIRAs opened, but deferred, to solve this problem. There are some capability in the current version to propagate the equality transitivity but the behaviour is not consistent. Predicate transformation like extracting common subterms. An example is the predicate ``(a=1 || b=2) && (a=1 || c=4)`` and a is a column from a different stream of columns b and c should be transformed to ``a=1 && (b=2 || c=4)``. A more complex case is the predicate ``(a=1 || b=2) && (a=3 || c=4)`` should have a new predicate ``(a=1 || a=3)`` added as a superset predicate to early filter the stream of a to just the two values needed. Introducing superset, redundant predicates like the last example above will complicate the computation of filter ratios of the predicates on a given stream when we introduce the Cost-based Optimization, which I assume depends on a good estimate of filter ratios on a given stream. This is because we cannot make assumption on the independent filtering affects among a set of predicates. Here the filter ratio of the newly generated superset predicate should be ignored in the filtering estimate. Another goal of predicate transformation is to derive contradiction and/or tautology. This is achieved by building the inequality relationships among the same column of a set of predicate. A simple example is ``a>1 && a < 1`` should be evaluated to ``false`` at the compile time and eliminate the scan of the stream completely. The stream is treated like producing an empty set. Depending on the context, the stream may be substituted by a NULL row when it is a subquery in an existential (EXISTS) or a universal (ALL) subquery, or a singleton NULL value when it is a scalar subquery. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14912 To maintain the predicate sets may increase much complexity as I can think. I don't know how big the set could be. But once you change one of the predicates, you need to construct all equivalent predicates in the set too. I think we can maintain CNF and simplification predicates. CNF should be enough to push down predicates and simplification predicate can be used in Filter execution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14912 I am thinking whether it makes more sense to maintain multiple semantically equivalent predicate sets for each `Filter`. In your example, we have both `(a > 10 || b > 2) && (a > 10 || c == 3)` and `(a > 10) || (b > 2 && c == 3)`. If we also considering the predicate transitivity inferences and predicate simplication at the same time, we could have multiple semantically equivalent predicate sets. Then, we have more chances to push down the predicates. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14912 also cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14912 @gatorsmile I've described it in the pr description. Simply said, now a Filter will be stopped to pushdown once it encounters another Filter. `BooleanSimplification` rule will simplify the predicate to a form that can't be pushed down in next round of optimization. For example, (a > 10 || b > 2) && (a > 10 || c == 3) will be simplified as (a > 10) || (b > 2 && c == 3). This patch does is to perform `PushDownPredicate` once the adjoining Filters are merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14912 Could you define the conditions in which the predicates are unable to be pushed down? Then, we can easily justify the significance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14912 @srinathshankar @gatorsmile I think CNF is another issue other then the issue this PR was proposed to solve at the first. I would like to solve the original adjoining Filter pushdown problem here. And leave CNF issue (it is not trivial and I don't expect it will be solved soon) for later PRs. What do you think? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14912 The CNF exponential expansion issue is an important concern in previous works. Actually you can find that this patch doesn't produce a real CNF for predicate. I use `splitDisjunctivePredicates` to obtain disjunctive predicates and convert them to conjunctive form. The conversion here is not recursive. I think this should prevent exponential explosion. Of course it is a compromise and can't benefit for all predicates. But I would suspect how often a complex predicate need complete conversion of CNF is used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14912 hmm, looks like there are previous works regarding CNF but none of them are really merged. @gatorsmile Thanks for the context. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14912 @viirya Could you please wait for the CNF predicate normalization rule? @liancheng @yjshen did a few related work before. See https://github.com/apache/spark/pull/10444 and https://github.com/apache/spark/pull/8200. Let us also collect the inputs from @ioana-delaney @nsyca . They did a lot of related work in the past 10+ years. We need a good design about CNF normalization, which can benefit the other optimizer rules. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14912 @srinathshankar I've addressed your comments. Please take a look. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14912 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14912 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64929/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14912 **[Test build #64929 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64929/consoleFull)** for PR 14912 at commit [`8f6f91d`](https://github.com/apache/spark/commit/8f6f91df7fe1d02a69215aea6ca9ae0b37416747). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org