Hi, Flink maintains its own Kafka offsets in its checkpoints and does not rely on Kafka's offset management. That way Flink guarantees that read offsets and checkpointed operator state are always aligned. The BucketingSink is designed to not lose any data and the mode of operation is described in detail in JavaDocs of the class [1].
Please check the JavaDocs and let me know if you have questions or doubts about the mechanism. Best, Fabian [1] https://github.com/apache/flink/blob/release-1.4/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java 2018-03-21 7:38 GMT+01:00 XilangYan <xilang....@gmail.com>: > Thank you! Fabian > > HDFS small file problem can be avoid with big checkpoint interval. > > Meanwhile, there is potential data lose problem in current BucketingSink. > Say we consume data in kafka, when checkpoint is requested, kafka offset is > update, but in-progress file in BucketingSink is remained. If flink crushed > after that, data in the in-progress file is lost. Am I right? > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/ >