sunchao commented on code in PR #37881:
URL: https://github.com/apache/spark/pull/37881#discussion_r972297552
##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##
@@ -186,10 +186,10 @@ object FileSourceStrategy extends Strategy with
PredicateHelper with Logging {
// Partition keys are not available in the statistics of the files.
// `dataColumns` might have partition columns, we need to filter them
out.
- val dataColumnsWithoutPartitionCols =
dataColumns.filterNot(partitionColumns.contains)
+ val dataColumnsWithoutPartitionCols = AttributeSet(dataColumns) --
partitionColumns
Review Comment:
@sadikovi np. I explained a bit in the PR description. Let me add more
details.
There are two steps when calculating data filters for V1 file source:
1. compute `dataColumnsWithoutPartitionCols`
2. call `extractPredicatesWithinOutputSet` using
`dataColumnsWithoutPartitionCols ` on the filters, to obtain data only filters
which are supposed to be pushed down to data sources
In the first step, the equality check is done via
[`AttributeReference.equals`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L275),
which checks attribute name, among other things.
In the second step, however, the equality is checked via
[`AttributeEquals.equals`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeSet.scala#L29)
which only checks expression ID.
This inconsistency poses an issue when case sensitive check is false (which
is the default behavior). For the example in the PR description:
- data column: `[COL, a]`
- partition column: `[col]`
- filter: `col > 10`
The expression ID for data column `COL` and partition column `col` are the
same because of case insensitivity. In the first step above, however, `COL` and
`col` are not considered equal and thus the `dataColumnsWithoutPartitionCols`
will still be `[COL, a]`. In the second step, the data filters are calculated
using `AttributeEquals.equals` and `COL` is treated as data column. As result,
filter `col > 10` is considered as data filter and pushed down.
In general, where data columns overlap with partition columns and case
sensitivity is false, the first step will not filter out partition columns, so
they will still be used in the second step. This is incorrect.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org