bart-samwel commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-644719642
The option `fileModifiedDate` doesn't say at all that it's a minimum
modified date. I can imagine use cases for lower bounds, upper bounds, ranges.
That requires at least two options, e.g. `filesModifiedAfter` and
`filesModifiedBefore`.
There's also option `pathGlobFilter` which only supports globs, but there as
well there may be other use cases, e.g. "files with path names
lexicographically larger than a file name", or "files with names that, after
parsing, satisfy some interesting condition".
It seems to me that this is asking for some more generic filtering
functionality. E.g. something like `.fileFilter(lambda)`, where the lambda
receives an object argument that has not only the path but also things like the
modification date. That said, specific options may be pushed down into the data
source (e.g. S3 supports prefix filters and `start-from`), so it would make
sense to keep things as options when pushdown might be possible.
Based on weighing the options, I would suggest using two options, for min
and max.
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org