[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing URL: https://github.com/apache/spark/pull/27157#issuecomment-576522610 @gengliangwang thanks for reviewing and merging! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing URL: https://github.com/apache/spark/pull/27157#issuecomment-575870211 @gengliangwang thanks for reviewing. I agree with your concern, and also this can be improved in subsequent PRs which will require a broader change in the V2 DataSource API. I'll be glad to help with that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing URL: https://github.com/apache/spark/pull/27157#issuecomment-575825450 @gengliangwang @cloud-fan can you please review this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing URL: https://github.com/apache/spark/pull/27157#issuecomment-574533965 @gengliangwang see also this [PR](https://github.com/apache/spark/pull/17322) which originally added the `dataFilters` to the list files method. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing URL: https://github.com/apache/spark/pull/27157#issuecomment-573543733 @gengliangwang by `"data skipping uniformly for all file based data sources"` I mean that the above approach works uniformly for all formats whether they support pushdown or not. (It has also benefits for formats which support pushdown such as parquet by avoiding the need to read the footer of each file). See for example this [Spark Summit talk](https://databricks.com/session/using-pluggable-apache-spark-sql-filters-to-help-gridpocket-users-keep-up-with-the-jones-and-save-the-planet). Note that in datasource v1 the `dataFilters` are also passed to the `listFiles` method in the [`FileSourceScanExec`](https://github.com/apache/spark/blob/eefcc7d762a627bf19cab7041a1a82f88862e7e1/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L210) case class which is used by all of the file based datasources. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing URL: https://github.com/apache/spark/pull/27157#issuecomment-572947015 @gengliangwang I have fixed the tests and added also a test for Avro scan without `partitionFilters` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing URL: https://github.com/apache/spark/pull/27157#issuecomment-572889083 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing URL: https://github.com/apache/spark/pull/27157#issuecomment-572823086 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing URL: https://github.com/apache/spark/pull/27157#issuecomment-572819405 @gengliangwang as for tests I have added to the existing tests a check that the `dataFilters` are indeed passed to the `FileScan`. In addition I have added a test which doesn't have `partitionFilters` so only the `dataFilters` should be not empty. Since the current `FileIndex` is not affected by the `dataFilters` there is no test that checks any pruning besides the filtering that is done by the `partitionFilters` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org