We are using Flink File Connector to continuously scan s3 paths. We use
FileSource which uses NonSplittingRecursiveEnumerator to scan the s3 paths.
For each parent path, the enumerateSplits function will recursively call S3
list for each S3 “sub-directory”, which can result in a large number of
calls to S3 as the number of sub-directories grows. This can be excessively
slow versus calling the S3 ListObjectsV2 API directly on the parent path.


For example, if the parent path is /test, and there are 1000 subdirectories
under /test, this will result in 1000 calls to S3 versus 1 call to S3.


Could you let us know if this behavior is expected and in such case could
be optimized by reducing the high number of reads? This is a blocker for us.


Thank you.

Reply via email to