[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing

2020-01-20 Thread GitBox
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push 
data filters for file listing
URL: https://github.com/apache/spark/pull/27157#issuecomment-576522610
 
 
   @gengliangwang thanks for reviewing and merging!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing

2020-01-17 Thread GitBox
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push 
data filters for file listing
URL: https://github.com/apache/spark/pull/27157#issuecomment-575870211
 
 
   @gengliangwang thanks for reviewing.
   I agree with your concern, and also this can be improved in subsequent PRs 
which will require a broader change in the V2 DataSource API. I'll be glad to 
help with that.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing

2020-01-17 Thread GitBox
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push 
data filters for file listing
URL: https://github.com/apache/spark/pull/27157#issuecomment-575825450
 
 
   @gengliangwang @cloud-fan can you please review this PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing

2020-01-14 Thread GitBox
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push 
data filters for file listing
URL: https://github.com/apache/spark/pull/27157#issuecomment-574533965
 
 
   @gengliangwang see also this 
[PR](https://github.com/apache/spark/pull/17322) which originally added the 
`dataFilters` to the list files method.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing

2020-01-12 Thread GitBox
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push 
data filters for file listing
URL: https://github.com/apache/spark/pull/27157#issuecomment-573543733
 
 
   @gengliangwang by `"data skipping uniformly for all file based data 
sources"` I mean that the above approach works uniformly for all formats 
whether they support pushdown or not. 
   (It has also benefits for formats which support pushdown such as parquet by 
avoiding the need to read the footer of each file).
   See for example this [Spark Summit 
talk](https://databricks.com/session/using-pluggable-apache-spark-sql-filters-to-help-gridpocket-users-keep-up-with-the-jones-and-save-the-planet).
   
   Note that in datasource v1 the `dataFilters` are also passed to the 
`listFiles` method in the 
[`FileSourceScanExec`](https://github.com/apache/spark/blob/eefcc7d762a627bf19cab7041a1a82f88862e7e1/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L210)
 case class which is used by all of the file based datasources.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing

2020-01-10 Thread GitBox
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push 
data filters for file listing
URL: https://github.com/apache/spark/pull/27157#issuecomment-572947015
 
 
   @gengliangwang I have fixed the tests and added also a test for Avro scan 
without `partitionFilters`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing

2020-01-09 Thread GitBox
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push 
data filters for file listing
URL: https://github.com/apache/spark/pull/27157#issuecomment-572889083
 
 
   retest this please


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing

2020-01-09 Thread GitBox
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push 
data filters for file listing
URL: https://github.com/apache/spark/pull/27157#issuecomment-572823086
 
 
   retest this please


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push data filters for file listing

2020-01-09 Thread GitBox
guykhazma commented on issue #27157: [SPARK-30475][SQL] File source V2: Push 
data filters for file listing
URL: https://github.com/apache/spark/pull/27157#issuecomment-572819405
 
 
   @gengliangwang as for tests I have added to the existing tests a check that 
the `dataFilters` are indeed passed to the `FileScan`.
   In addition I have added a test which doesn't have `partitionFilters` so 
only the `dataFilters` should  be not empty.
   Since the current `FileIndex` is not affected by the `dataFilters` there is 
no test that checks any pruning besides the filtering that is done by the 
`partitionFilters`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org