Hello, Today, we received an alert because the operator appeared to be down. Upon further investigation, we realized the alert was triggered because the endpoint for Prometheus metrics (which we enabled) stopped responding, so it seems the endpoint used for the liveness probe wasn't affected and the pod was not restarted automatically.
The logs right before the problem started don't show anything odd, and once the problem started, the logs were spammed with warning messages stating "Connection reset by peer" with no further information. From what I can see, nothing else was logged during that time, so it looks like the process really had stalled. I imagine this is not easy to reproduce and, while a pod restart was enough to get back on track, it might be worth improving the liveness probe to catch these situations. Full stacktrace for reference: An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception. java.io.IOException: Connection reset by peer at java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method) at java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source) at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) at java.base/sun.nio.ch.IOUtil.read(Unknown Source) at java.base/sun.nio.ch.IOUtil.read(Unknown Source) at java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source) at org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledDirectByteBuf.setBytes(UnpooledDirectByteBuf.java:570) at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132) at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350) at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Unknown Source) Regards, Alexis.