Hello, If I recall correctly, pekko's frame size (and also akka's in the past) was always an issue. I think documentation said that sometimes the application just needs a larger size and it's not possible to know in advance when that can happen. Today we saw a job restart and subsequently crashloop with this exception cause:
Caused by: java.util.concurrent.TimeoutException: Invocation of [RemoteRpcInvocation(TaskExecutorGateway.submitTask(TaskDeploymentDescriptor, JobMasterId, Duration))] at recipient [pekko.tcp:// [email protected]:6122/user/rpc/taskmanager_0] timed out. This is usually caused by: 1) Pekko failed sending the message silently, due to problems like oversized payload or serialization failures. In that case, you should find detailed error information in the logs. 2) The recipient needs more time for responding, due to problems like slow machines or network jitters. In that case, you can try to increase pekko.ask.timeout. To fix this, I increased both pekko.ask.timeout & pekko.framesize simultaneously, so I'm not sure which one was the root cause, but in any case, is there still no way to monitor if this limit could be reached before it happens? This was with Flink 2.1.1 Regards, Alexis.
