Hello,

Today, we received an alert because the operator appeared to be down. Upon
further investigation, we realized the alert was triggered because the
endpoint for Prometheus metrics (which we enabled) stopped responding, so
it seems the endpoint used for the liveness probe wasn't affected and the
pod was not restarted automatically.

The logs right before the problem started don't show anything odd, and once
the problem started, the logs were spammed with warning messages stating
"Connection reset by peer" with no further information. From what I can
see, nothing else was logged during that time, so it looks like the process
really had stalled.

I imagine this is not easy to reproduce and, while a pod restart was enough
to get back on track, it might be worth improving the liveness probe to
catch these situations.

Full stacktrace for reference:

An exceptionCaught() event was fired, and it reached at the tail of the
pipeline. It usually means the last handler in the pipeline did not handle
the exception.
java.io.IOException: Connection reset by peer at
java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method) at
java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source) at
java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) at
java.base/sun.nio.ch.IOUtil.read(Unknown Source) at
java.base/sun.nio.ch.IOUtil.read(Unknown Source) at
java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source) at
org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledDirectByteBuf.setBytes(UnpooledDirectByteBuf.java:570)
at
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
at
org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at
org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)

Regards,
Alexis.

Reply via email to