I've responded on the dev ML. Let's continue the discussion there: https://lists.apache.org/thread.html/r251e395c759193d9c75f97b8bfc4917772219ea48bb1848ccc23d26e%40%3Cdev.flink.apache.org%3E
Cheers, Till On Thu, Jul 30, 2020 at 8:57 PM Paul Bernier <[email protected]> wrote: > Hi experts, > > > > I am trying to use S3 StreamingFileSink with a high number of active > buckets (>1000). I found that checkpointing duration will grow linearly > with the number of active buckets, which makes achieving high number of > active buckets difficult. One reason for that is the each active buckets > are snapshotted sequentially in a loop > <https://github.com/apache/flink/blob/0a1b0c615c4968afd29f09a3494bbf137a223609/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/Buckets.java#L245>. > Given that operation involves waiting for some data to finish being > uploaded to S3 that can become quite a long wait. > > > > My question is: could this loop be safely multi-threaded? > > Each Bucket seems independent (they do share the bucketWriter though). I > have also done some basic prototyping and validation and it looks ok. So I > wondering if I am overlooking anything and if my approach is viable? > > > > Note: the same approach would also need to be applied to the > onSuccessfulCompletionOfCheckpoint step with this while loop committing > files to S3 > <https://github.com/apache/flink/blob/0a1b0c615c4968afd29f09a3494bbf137a223609/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/Buckets.java#L208> > . > > > > Thank you. > > > > Paul >
