[GitHub] [spark] gengliangwang commented on a diff in pull request #38511: [SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters

2022-11-15 Thread GitBox


gengliangwang commented on code in PR #38511:
URL: https://github.com/apache/spark/pull/38511#discussion_r1023241331


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala:
##
@@ -29,26 +29,13 @@ import org.apache.spark.sql.errors.QueryCompilationErrors
 import org.apache.spark.sql.execution.datasources.v2.{DataSourceV2Relation, 
DataSourceV2ScanRelation}
 import org.apache.spark.sql.internal.SQLConf
 
-/**
- * A pattern that matches any number of project or filter operations even if 
they are
- * non-deterministic, as long as they satisfy the requirement of 
CollapseProject and CombineFilters.
- * All filter operators are collected and their conditions are broken up and 
returned
- * together with the top project operator. [[Alias Aliases]] are 
in-lined/substituted if
- * necessary.
- */
-object PhysicalOperation extends AliasHelper with PredicateHelper {
+trait OperationHelper extends AliasHelper with PredicateHelper {
   import 
org.apache.spark.sql.catalyst.optimizer.CollapseProject.canCollapseExpressions
 
-  type ReturnType =
-(Seq[NamedExpression], Seq[Expression], LogicalPlan)
   type IntermediateType =
-(Option[Seq[NamedExpression]], Seq[Expression], LogicalPlan, 
AttributeMap[Alias])
+(Option[Seq[NamedExpression]], Seq[Seq[Expression]], LogicalPlan, 
AttributeMap[Alias])
 
-  def unapply(plan: LogicalPlan): Option[ReturnType] = {
-val alwaysInline = 
SQLConf.get.getConf(SQLConf.COLLAPSE_PROJECT_ALWAYS_INLINE)
-val (fields, filters, child, _) = collectProjectsAndFilters(plan, 
alwaysInline)
-Some((fields.getOrElse(child.output), filters, child))
-  }
+  protected def canKeepMultipleFilters: Boolean

Review Comment:
   Nit: add a simple comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #38511: [SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters

2022-11-15 Thread GitBox


gengliangwang commented on code in PR #38511:
URL: https://github.com/apache/spark/pull/38511#discussion_r1023241088


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##
@@ -146,8 +146,12 @@ object FileSourceStrategy extends Strategy with 
PredicateHelper with Logging {
   }
 
   def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
-case PhysicalOperation(projects, filters,
+case ScanOperation(projects, allFilters,
   l @ LogicalRelation(fsRelation: HadoopFsRelation, _, table, _)) =>
+  // We can only push down the bottom-most filter to the relation, as 
`ScanOperation` decided to
+  // not merge these filters and we need to keep their evaluation order.
+  val filters = allFilters.lastOption.getOrElse(Nil)

Review Comment:
   So for filter pushdown, we will use the last filter. For schema pruning, we 
will use all the filters. 
   I wonder if we should return both `allFilters` and `pushdownFilters` to make 
the syntax clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #38511: [SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters

2022-11-15 Thread GitBox


gengliangwang commented on code in PR #38511:
URL: https://github.com/apache/spark/pull/38511#discussion_r1023204106


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala:
##
@@ -85,15 +72,25 @@ object PhysicalOperation extends AliasHelper with 
PredicateHelper {
 // projects. We need to meet the following conditions to do so:
 //   1) no Project collected so far or the collected Projects are all 
deterministic
 //   2) the collected filters and this filter are all deterministic, 
or this is the
-//  first collected filter.
+//  first collected filter. This condition can be relaxed if 
`canKeepMultipleFilters` is

Review Comment:
   TBH, the comment here is hard to understand..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org