Hi Giselle, could you share the logs of this run with us? They might contain some details. Could you also give us a bit more details about the Flink job and which Flink version you are using?
Have you tried using a different netty transport type via `taskmanager.network.netty.transport`? You could set it to `nio`, for example. I am also pulling in Piotr who might know more about problems in the network stack. Cheers, Till On Sat, Nov 7, 2020 at 9:11 AM Giselle van Dongen < giselle.vandon...@ugent.be> wrote: > Dear community, > > We have a Flink job which does some parsing, a join and a window. > When we increase the load, CPU increases gradually with the throughput. > But around 65% CPU, there is suddenly a jump to 98%. > The job starts experiencing backpressure and becomes unstable (increasing > latency, memory doesn't get cleaned up well anymore). > When profiling CPU, we notice that most CPU time is going to epollwait > from netty (40-60%). We see this before and after the job becomes unstable. > Does this mean it has something to do with network saturation? > We also see checkpointing taking around a second at this point (160MB). > > What are some avenues we can explore to improve this? > > Thank you for any help provided! > > Giselle > >