Hi,
At the first glance I can not find anything wrong with those settings. If
it was some memory configuration problem that caused this error, I guess it
would be visible as an exception somewhere. It's unlikely a GC issue, as if
some machine froze and stopped responding for a longer period of tim
Hi,
This exception looks like it was thrown by a downstream Task/TaskManager
when trying to read a message/packet from some upstream Task/TaskManager
and that connection between two TaskManagers was reseted (closed abruptly).
So it's the case:
> involves communicating with other non-collocated tas
Hello, Piotr.
Thank you.
This is an error logged to the taskmanager just before it became "lost" to
the jobmanager (i.e., reported as "lost" in the jobmanager log just before
the job restart). In what context would this particular error (not the
root-root cause you referred to) be thrown from a t
Hi Kye,
Almost for sure this error is not the primary cause of the failure. This
error means that the node reporting it, has detected some fatal failure on
the other side of the wire (connection reset by peer), but the original
error is somehow too slow or unable to propagate to the JobManager bef
I forgot to mention: this is Flink 1.10.
-K
On Mon, Dec 7, 2020 at 5:08 PM Kye Bae wrote:
> Hello!
>
> We have a real-time streaming workflow that has been running for about 2.5
> weeks.
>
> Then, we began to get the exception below from taskmanagers (random) since
> yesterday, and the job bega