Hi Giselle,

could you share the logs of this run with us? They might contain some
details. Could you also give us a bit more details about the Flink job and
which Flink version you are using?

Have you tried using a different netty transport type via
`taskmanager.network.netty.transport`? You could set it to `nio`, for
example.

I am also pulling in Piotr who might know more about problems in the
network stack.

Cheers,
Till

On Sat, Nov 7, 2020 at 9:11 AM Giselle van Dongen <
giselle.vandon...@ugent.be> wrote:

> Dear community,
>
> We have a Flink job which does some parsing, a join and a window.
> When we increase the load, CPU increases gradually with the throughput.
> But around 65% CPU, there is suddenly a jump to 98%.
> The job starts experiencing backpressure and becomes unstable (increasing
> latency, memory doesn't get cleaned up well anymore).
> When profiling CPU, we notice that most CPU time is going to epollwait
> from netty (40-60%). We see this before and after the job becomes unstable.
> Does this mean it has something to do with network saturation?
> We also see checkpointing taking around a second at this point (160MB).
>
> What are some avenues we can explore to improve this?
>
> Thank you for any help provided!
>
> Giselle
>
>

Reply via email to