Hi Alexis, We have recently added support for canary deployments which allows the liveness probe to detect general operator problems.
https://issues.apache.org/jira/browse/FLINK-31219 It's not completely automatic and you have to deploy the canaries yourself but I think it will be helpful :) This will be part of the upcoming 1.5.0 release. Cheers, Gyula On Fri, Apr 21, 2023 at 11:50 PM Alexis Sarda-Espinosa < sarda.espin...@gmail.com> wrote: > Hello, > > Today, we received an alert because the operator appeared to be down. Upon > further investigation, we realized the alert was triggered because the > endpoint for Prometheus metrics (which we enabled) stopped responding, so > it seems the endpoint used for the liveness probe wasn't affected and the > pod was not restarted automatically. > > The logs right before the problem started don't show anything odd, and > once the problem started, the logs were spammed with warning messages > stating "Connection reset by peer" with no further information. From what I > can see, nothing else was logged during that time, so it looks like the > process really had stalled. > > I imagine this is not easy to reproduce and, while a pod restart was > enough to get back on track, it might be worth improving the liveness probe > to catch these situations. > > Full stacktrace for reference: > > An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > java.io.IOException: Connection reset by peer at > java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method) at > java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source) at > java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) at > java.base/sun.nio.ch.IOUtil.read(Unknown Source) at > java.base/sun.nio.ch.IOUtil.read(Unknown Source) at > java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source) at > org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledDirectByteBuf.setBytes(UnpooledDirectByteBuf.java:570) > at > org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132) > at > org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) > at > org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Unknown Source) > > Regards, > Alexis. > >