Hi Filip, two things will impact sync time for Kafka: 1. Flushing all old data [1], in particular flushing all in-flight partitions [2]. However, that shouldn't cause a stacking effect except when the brokers are overloaded on checkpoint. 2. Opening a new transaction [3]. Since all transactions are linearized on the Kafka brokers, this is the most likely root cause. Note that aborted checkpoints may require multiple transactions to be opened. So you could check if you have them quite often aborted.
If you want to know more, I suggest you attach a profiler and find the specific culprit and report back [4]. There is a low probability that the sink framework has a bug that causes this behavior. In that case, we can fix it more easily than if it's a fundamental issue with Kafka. In general, exactly-once and low latency are somewhat contradicting requirements, so there is only so much you can do. Not knowing your topology but maybe you can reduce the number of sinks? With the KafkaRecordSerializationSchema you can set different topics for different ProducerRecords of the same DataStream. [1] https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L190-L190 [2] https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/FlinkKafkaInternalProducer.java#L177-L183 [3] https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L302-L321 [4] https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/debugging/application_profiling/ On Sat, Mar 26, 2022 at 2:11 PM Filip Karnicki <filip.karni...@gmail.com> wrote: > Hi, I noticed that with each added (kafka) sink with exactly-once > guarantees, there looks to be a penalty of ~100ms in terms of sync > checkpointing time. > > Would anyone be able to explain and/or point me in the right direction in > the source code so that I could understand why that is? Specifically, why > there appears to be a 100ms added for _each_ sink, and not a flat 100ms for > all sinks, potentially pointing to a sequential set of IO calls (wiiiild > guess) > > I would be keen to understand if there's anything I could do (incl. > contributing code) that would parallelise this penalty in terms of sync > checkpointing time. > > Alternatively, is there any setting that would help me bring the sync > checkpointing time down (and still get exactly-once guarantees)? > > Many thanks, > Fil >