Hello,

If I recall correctly, pekko's frame size (and also akka's in the past) was
always an issue. I think documentation said that sometimes the application
just needs a larger size and it's not possible to know in advance when that
can happen. Today we saw a job restart and subsequently crashloop with this
exception cause:

Caused by: java.util.concurrent.TimeoutException: Invocation of
[RemoteRpcInvocation(TaskExecutorGateway.submitTask(TaskDeploymentDescriptor,
JobMasterId, Duration))] at recipient [pekko.tcp://
[email protected]:6122/user/rpc/taskmanager_0] timed out. This is
usually caused by: 1) Pekko failed sending the message silently, due to
problems like oversized payload or serialization failures. In that case,
you should find detailed error information in the logs. 2) The recipient
needs more time for responding, due to problems like slow machines or
network jitters. In that case, you can try to increase pekko.ask.timeout.

To fix this, I increased both pekko.ask.timeout & pekko.framesize
simultaneously, so I'm not sure which one was the root cause, but in any
case, is there still no way to monitor if this limit could be reached
before it happens?

This was with Flink 2.1.1

Regards,
Alexis.

Reply via email to