Hi fellow flink users,
I'd like to seek advice on how to find the performance bottleneck of a stateful functions pipeline. The throughput is too low. Ideally we could push it to 2000 messages/s, but I don't get it above 100/s. The pipeline quickly gets under backpressure. Some facts: * The pipeline is running on a powerful Kubernetes cluster, with rocksDB state backend writing to a Hadoop volume. * There are six functions, only one of them makes use of state * Ingress and egress are via kafka * The pipeline is set to "exactly once" semantics with checkpoints every 10 seconds Here a picture from the Flink UI, showing that the active ingress is backpressured. The functions task has subtasks which take turns in being 100% busy: What I tried: * Scale up all functions deployments heavily, although each container is under low load * Increase the memory for the task managers to 16 GB each * Increase the parallelism from 3 to 7 task managers * Tried switching on "buffer debloating" (https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpo inting_under_backpressure/) * Set execution.checkpointing.aligned-checkpoint-timeout: 300sec, because I saw * Increase "maxNumBatchRequests" for all funtions I hope this is all, I tried so many things. How can I figure out, why the pipeline is slow, i.e. what the bottleneck is? Thanks for any advice. Best, Christian -- Dr. Christian Krudewig Corporate Development - Data Analytics Deutsche Post DHL
smime.p7s
Description: S/MIME cryptographic signature