Hi Ashwin, Thanks for reaching out to the Flink community. Since you have tested that a kafka_source -> discarding_sink can process 10 Million records/s you might also wanna test the write throughput to data_sink and dlq_sink. Maybe these sinks are limiting your overall throughput by backpressuring the data flow. If this is not the problem, then I believe that some profiling could help pinpointing the bottleneck.
Cheers, Till On Sun, Nov 8, 2020 at 10:26 PM ashwin konale <ashwin.kon...@gmail.com> wrote: > Hey guys, > I am struggling to improve the throughput of my simple flink application. > The target topology is this. > > read_from_kafka(byte array deserializer) --rescale--> > processFunction(confluent avro deserialization) -> split -> 1. > data_sink,2.dlq_sink > > Kafka traffic is pretty high > Partitions: 128 > Traffic: ~500k msg/s, 50Mbps. > > Flink is running on k8s writing to hdfs3. I have ~200CPU and 400G memory > at hand. I have tried few configurations but I am not able to get the > throughput more than 1mil per second. (Which I need for recovering from > failures). I have tried increasing parallelism a lot (until 512), But it > has very little impact on the throughput. Primary metric I am considering > for throughput is kafka-source, numRecordsOut and message backlog. I have > already increased default kafka consumer defaults like max.poll.records > etc. Here are the few things I tried already. > Try0: Check raw kafka consumer throughput (kafka_source -> discarding_sink) > tm: 20, slots:4, parallelism 80 > throughput: 10Mil/s > > Try1: Disable chaining to introduce network related lag. > tm: 20, slots:4, parallelism 80 > throughput: 1Mil/s > Also tried with increasing floating-buffers to 100, and > buffers-per-channel to 64. Increasing parallelism seems to have no effect. > Observation: out/in buffers are always at 100% utilization. > > After this I have tried various different things with different network > configs, parallelism,jvm sizes etc. But throughput seems to be stuck at > 1Mil. Can someone please help me to figure out what key metrics to look for > and how can I improve the situation. Happy to provide any details needed. > > Flink version: 1.11.2 >