Hi Filip,

two things will impact sync time for Kafka:
1. Flushing all old data [1], in particular flushing all in-flight
partitions [2]. However, that shouldn't cause a stacking effect except when
the brokers are overloaded on checkpoint.
2. Opening a new transaction [3]. Since all transactions are linearized on
the Kafka brokers, this is the most likely root cause. Note that aborted
checkpoints may require multiple transactions to be opened. So you could
check if you have them quite often aborted.

If you want to know more, I suggest you attach a profiler and find the
specific culprit and report back [4]. There is a low probability that the
sink framework has a bug that causes this behavior. In that case, we can
fix it more easily than if it's a fundamental issue with Kafka. In general,
exactly-once and low latency are somewhat contradicting requirements, so
there is only so much you can do.

Not knowing your topology but maybe you can reduce the number of sinks?
With the KafkaRecordSerializationSchema you can set different topics for
different ProducerRecords of the same DataStream.

[1]
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L190-L190
[2]
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/FlinkKafkaInternalProducer.java#L177-L183
[3]
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L302-L321
[4]
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/debugging/application_profiling/


On Sat, Mar 26, 2022 at 2:11 PM Filip Karnicki <filip.karni...@gmail.com>
wrote:

> Hi, I noticed that with each added (kafka) sink with exactly-once
> guarantees, there looks to be a penalty of ~100ms in terms of sync
> checkpointing time.
>
> Would anyone be able to explain and/or point me in the right direction in
> the source code so that I could understand why that is? Specifically, why
> there appears to be a 100ms added for _each_ sink, and not a flat 100ms for
> all sinks, potentially pointing to a sequential set of IO calls (wiiiild
> guess)
>
> I would be keen to understand if there's anything I could do (incl.
> contributing code) that would parallelise this penalty in terms of sync
> checkpointing time.
>
> Alternatively, is there any setting that would help me bring the sync
> checkpointing time down (and still get exactly-once guarantees)?
>
> Many thanks,
> Fil
>

Reply via email to