Hi Alexis,

We have recently added support for canary deployments which allows the
liveness probe to detect general operator problems.

https://issues.apache.org/jira/browse/FLINK-31219

It's not completely automatic and you have to deploy the canaries yourself
but I think it will be helpful :)
This will be part of the upcoming 1.5.0 release.

Cheers,
Gyula

On Fri, Apr 21, 2023 at 11:50 PM Alexis Sarda-Espinosa <
sarda.espin...@gmail.com> wrote:

> Hello,
>
> Today, we received an alert because the operator appeared to be down. Upon
> further investigation, we realized the alert was triggered because the
> endpoint for Prometheus metrics (which we enabled) stopped responding, so
> it seems the endpoint used for the liveness probe wasn't affected and the
> pod was not restarted automatically.
>
> The logs right before the problem started don't show anything odd, and
> once the problem started, the logs were spammed with warning messages
> stating "Connection reset by peer" with no further information. From what I
> can see, nothing else was logged during that time, so it looks like the
> process really had stalled.
>
> I imagine this is not easy to reproduce and, while a pod restart was
> enough to get back on track, it might be worth improving the liveness probe
> to catch these situations.
>
> Full stacktrace for reference:
>
> An exceptionCaught() event was fired, and it reached at the tail of the
> pipeline. It usually means the last handler in the pipeline did not handle
> the exception.
> java.io.IOException: Connection reset by peer at
> java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method) at
> java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source) at
> java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) at
> java.base/sun.nio.ch.IOUtil.read(Unknown Source) at
> java.base/sun.nio.ch.IOUtil.read(Unknown Source) at
> java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source) at
> org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledDirectByteBuf.setBytes(UnpooledDirectByteBuf.java:570)
> at
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
> at
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Unknown Source)
>
> Regards,
> Alexis.
>
>

Reply via email to