[GitHub] [spark] sunchao commented on a diff in pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema

2022-09-15 Thread GitBox


sunchao commented on code in PR #37881:
URL: https://github.com/apache/spark/pull/37881#discussion_r972301689


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##
@@ -186,10 +186,10 @@ object FileSourceStrategy extends Strategy with 
PredicateHelper with Logging {
 
   // Partition keys are not available in the statistics of the files.
   // `dataColumns` might have partition columns, we need to filter them 
out.
-  val dataColumnsWithoutPartitionCols = 
dataColumns.filterNot(partitionColumns.contains)

Review Comment:
   I think semantically both are the same. In
   ```scala
   AttributeSet(dataColumns) -- partitionColumns
   ```
   `partitionColumns` is first wrapped into `AttributeSet` and then compared 
with `AttributeSet(dataColumns)`.
   
   Your version does require one less line of change though :)
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a diff in pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema

2022-09-15 Thread GitBox


sunchao commented on code in PR #37881:
URL: https://github.com/apache/spark/pull/37881#discussion_r972297552


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##
@@ -186,10 +186,10 @@ object FileSourceStrategy extends Strategy with 
PredicateHelper with Logging {
 
   // Partition keys are not available in the statistics of the files.
   // `dataColumns` might have partition columns, we need to filter them 
out.
-  val dataColumnsWithoutPartitionCols = 
dataColumns.filterNot(partitionColumns.contains)
+  val dataColumnsWithoutPartitionCols = AttributeSet(dataColumns) -- 
partitionColumns

Review Comment:
   @sadikovi np. I explained a bit in the PR description. Let me add more 
details. 
   
   There are two steps when calculating data filters for V1 file source:
   
   1. compute `dataColumnsWithoutPartitionCols`
   2. call `extractPredicatesWithinOutputSet` using 
`dataColumnsWithoutPartitionCols ` on the filters, to obtain data only filters 
which are supposed to be pushed down to data sources
   
   In the first step, the equality check is done via 
[`AttributeReference.equals`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L275),
 which checks attribute name, among other things.
   
   In the second step, however, the equality is checked via 
[`AttributeEquals.equals`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeSet.scala#L29)
 which only checks expression ID.
   
   This inconsistency poses an issue when case sensitive check is false (which 
is the default behavior). For the example in the PR description:
   
   - data column: `[COL, a]`
   - partition column: `[col]`
   - filter: `col > 10`
   
   The expression ID for data column `COL` and partition column `col` are the 
same because of case insensitivity. In the first step above, however, `COL` and 
`col` are not considered equal and thus the `dataColumnsWithoutPartitionCols` 
will still be `[COL, a]`. In the second step, the data filters are calculated 
using `AttributeEquals.equals` and `COL` is treated as data column. As result, 
filter `col > 10` is considered as data filter and pushed down.
   
   In general, where data columns overlap with partition columns and case 
sensitivity is false, the first step will not filter out partition columns, so 
they will still be used in the second step. This is incorrect.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org