[GitHub] [spark] gengliangwang commented on a diff in pull request #38511: [SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters
gengliangwang commented on code in PR #38511: URL: https://github.com/apache/spark/pull/38511#discussion_r1023241331 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala: ## @@ -29,26 +29,13 @@ import org.apache.spark.sql.errors.QueryCompilationErrors import org.apache.spark.sql.execution.datasources.v2.{DataSourceV2Relation, DataSourceV2ScanRelation} import org.apache.spark.sql.internal.SQLConf -/** - * A pattern that matches any number of project or filter operations even if they are - * non-deterministic, as long as they satisfy the requirement of CollapseProject and CombineFilters. - * All filter operators are collected and their conditions are broken up and returned - * together with the top project operator. [[Alias Aliases]] are in-lined/substituted if - * necessary. - */ -object PhysicalOperation extends AliasHelper with PredicateHelper { +trait OperationHelper extends AliasHelper with PredicateHelper { import org.apache.spark.sql.catalyst.optimizer.CollapseProject.canCollapseExpressions - type ReturnType = -(Seq[NamedExpression], Seq[Expression], LogicalPlan) type IntermediateType = -(Option[Seq[NamedExpression]], Seq[Expression], LogicalPlan, AttributeMap[Alias]) +(Option[Seq[NamedExpression]], Seq[Seq[Expression]], LogicalPlan, AttributeMap[Alias]) - def unapply(plan: LogicalPlan): Option[ReturnType] = { -val alwaysInline = SQLConf.get.getConf(SQLConf.COLLAPSE_PROJECT_ALWAYS_INLINE) -val (fields, filters, child, _) = collectProjectsAndFilters(plan, alwaysInline) -Some((fields.getOrElse(child.output), filters, child)) - } + protected def canKeepMultipleFilters: Boolean Review Comment: Nit: add a simple comment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #38511: [SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters
gengliangwang commented on code in PR #38511: URL: https://github.com/apache/spark/pull/38511#discussion_r1023241088 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala: ## @@ -146,8 +146,12 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging { } def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { -case PhysicalOperation(projects, filters, +case ScanOperation(projects, allFilters, l @ LogicalRelation(fsRelation: HadoopFsRelation, _, table, _)) => + // We can only push down the bottom-most filter to the relation, as `ScanOperation` decided to + // not merge these filters and we need to keep their evaluation order. + val filters = allFilters.lastOption.getOrElse(Nil) Review Comment: So for filter pushdown, we will use the last filter. For schema pruning, we will use all the filters. I wonder if we should return both `allFilters` and `pushdownFilters` to make the syntax clear. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #38511: [SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters
gengliangwang commented on code in PR #38511: URL: https://github.com/apache/spark/pull/38511#discussion_r1023204106 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala: ## @@ -85,15 +72,25 @@ object PhysicalOperation extends AliasHelper with PredicateHelper { // projects. We need to meet the following conditions to do so: // 1) no Project collected so far or the collected Projects are all deterministic // 2) the collected filters and this filter are all deterministic, or this is the -// first collected filter. +// first collected filter. This condition can be relaxed if `canKeepMultipleFilters` is Review Comment: TBH, the comment here is hard to understand.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org