Hi fellow flink users,

 

I'd like to seek advice on how to find the performance bottleneck of a
stateful functions pipeline. The throughput is too low. Ideally we could
push it to 2000 messages/s, but I don't get it above 100/s. The pipeline
quickly gets under backpressure.

 

Some facts: 

*       The pipeline is running on a powerful Kubernetes cluster, with
rocksDB state backend writing to a Hadoop volume. 
*       There are six functions, only one of them makes use of state
*       Ingress and egress are via kafka
*       The pipeline is set to "exactly once" semantics with checkpoints
every 10 seconds

 

Here a picture from the Flink UI, showing that the active ingress is
backpressured. The functions task has subtasks which take turns in being
100% busy:



 

What I tried:

*       Scale up all functions deployments heavily, although each container
is under low load
*       Increase the memory for the task managers to 16 GB each
*       Increase the parallelism from 3 to 7 task managers
*       Tried switching on "buffer debloating"
(https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpo
inting_under_backpressure/)
*       Set execution.checkpointing.aligned-checkpoint-timeout: 300sec,
because I saw 
*       Increase "maxNumBatchRequests" for all funtions

 

I hope this is all, I tried so many things.

 

How can I figure out, why the pipeline is slow, i.e. what the bottleneck is?

 

Thanks for any advice.

 

Best,

 

Christian

 

--

Dr. Christian Krudewig
Corporate Development - Data Analytics

Deutsche Post DHL



 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to