I think your idea of adding monitoring similar to network memory warnings
makes sense. Here's what I can roughly imagine without super deep
consideration:

1. Metrics for RPC message sizes - Track the serialized size of outgoing
RPC messages and expose metrics showing:
    - Current message size
    - Percentage of max frame size used
    - Peak message size
2. Warning logs - Similar to network memory warnings, log something like:
"RPC message size is X% of maximum frame size (Y/Z bytes). Consider
increasing pekko.framesize if you see timeouts."
3. Proactive detection - Before sending large messages, check if they
approach the frame size limit (e.g., >80%) and log warnings.

The monitoring points would be in PekkoInvocationHandler.java [1] (RPC
invocation handling) and PekkoRpcActor.java [2] (RPC message receiving).

BR,
G

[1]
https://github.com/apache/flink/blob/master/flink-rpc/flink-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/pekko/PekkoInvocationHandler.java
[2]
https://github.com/apache/flink/blob/master/flink-rpc/flink-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/pekko/PekkoRpcActor.java

On Wed, Jan 21, 2026 at 1:12 PM Alexis Sarda-Espinosa <
[email protected]> wrote:

> Hi Gabor,
>
> I currently can't offer much advice there because I don't know which
> components from the Flink framework depend on pekko's frame size.
>
> I know there are warnings logged for network memory and network buffers,
> something like "110% of network memory requested, max value X, consider
> increasing it." Maybe it's possible to have something like that for RPC
> frame size? Ideally exposed with a metric that shows when requests exceed a
> certain percentage of the max value.
>
> Regards,
> Alexis.
>
> On Wed, 21 Jan 2026, 12:42 Gabor Somogyi, <[email protected]>
> wrote:
>
>> Hi Alexis,
>>
>> I'm not aware of such feature. Just for my own understanding how could
>> you imagine such feature?
>>
>> BR,
>> G
>>
>>
>> On Wed, Jan 21, 2026 at 11:20 AM Alexis Sarda-Espinosa <
>> [email protected]> wrote:
>>
>>> Hello,
>>>
>>> If I recall correctly, pekko's frame size (and also akka's in the past)
>>> was always an issue. I think documentation said that sometimes the
>>> application just needs a larger size and it's not possible to know in
>>> advance when that can happen. Today we saw a job restart and subsequently
>>> crashloop with this exception cause:
>>>
>>> Caused by: java.util.concurrent.TimeoutException: Invocation of
>>> [RemoteRpcInvocation(TaskExecutorGateway.submitTask(TaskDeploymentDescriptor,
>>> JobMasterId, Duration))] at recipient [pekko.tcp://
>>> [email protected]:6122/user/rpc/taskmanager_0] timed out. This is
>>> usually caused by: 1) Pekko failed sending the message silently, due to
>>> problems like oversized payload or serialization failures. In that case,
>>> you should find detailed error information in the logs. 2) The recipient
>>> needs more time for responding, due to problems like slow machines or
>>> network jitters. In that case, you can try to increase pekko.ask.timeout.
>>>
>>> To fix this, I increased both pekko.ask.timeout & pekko.framesize
>>> simultaneously, so I'm not sure which one was the root cause, but in any
>>> case, is there still no way to monitor if this limit could be reached
>>> before it happens?
>>>
>>> This was with Flink 2.1.1
>>>
>>> Regards,
>>> Alexis.
>>>
>>

Reply via email to