Thanks for the details, I can enter a jira ticket and might even look
through the code later. Do you know where the metric system is configured
for things like network memory? Just for reference.

Regards,
Alexis.

On Wed, 21 Jan 2026, 14:10 Gabor Somogyi, <[email protected]> wrote:

> I think your idea of adding monitoring similar to network memory warnings
> makes sense. Here's what I can roughly imagine without super deep
> consideration:
>
> 1. Metrics for RPC message sizes - Track the serialized size of outgoing
> RPC messages and expose metrics showing:
>     - Current message size
>     - Percentage of max frame size used
>     - Peak message size
> 2. Warning logs - Similar to network memory warnings, log something like:
> "RPC message size is X% of maximum frame size (Y/Z bytes). Consider
> increasing pekko.framesize if you see timeouts."
> 3. Proactive detection - Before sending large messages, check if they
> approach the frame size limit (e.g., >80%) and log warnings.
>
> The monitoring points would be in PekkoInvocationHandler.java [1] (RPC
> invocation handling) and PekkoRpcActor.java [2] (RPC message receiving).
>
> BR,
> G
>
> [1]
> https://github.com/apache/flink/blob/master/flink-rpc/flink-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/pekko/PekkoInvocationHandler.java
> [2]
> https://github.com/apache/flink/blob/master/flink-rpc/flink-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/pekko/PekkoRpcActor.java
>
> On Wed, Jan 21, 2026 at 1:12 PM Alexis Sarda-Espinosa <
> [email protected]> wrote:
>
>> Hi Gabor,
>>
>> I currently can't offer much advice there because I don't know which
>> components from the Flink framework depend on pekko's frame size.
>>
>> I know there are warnings logged for network memory and network buffers,
>> something like "110% of network memory requested, max value X, consider
>> increasing it." Maybe it's possible to have something like that for RPC
>> frame size? Ideally exposed with a metric that shows when requests exceed a
>> certain percentage of the max value.
>>
>> Regards,
>> Alexis.
>>
>> On Wed, 21 Jan 2026, 12:42 Gabor Somogyi, <[email protected]>
>> wrote:
>>
>>> Hi Alexis,
>>>
>>> I'm not aware of such feature. Just for my own understanding how could
>>> you imagine such feature?
>>>
>>> BR,
>>> G
>>>
>>>
>>> On Wed, Jan 21, 2026 at 11:20 AM Alexis Sarda-Espinosa <
>>> [email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> If I recall correctly, pekko's frame size (and also akka's in the past)
>>>> was always an issue. I think documentation said that sometimes the
>>>> application just needs a larger size and it's not possible to know in
>>>> advance when that can happen. Today we saw a job restart and subsequently
>>>> crashloop with this exception cause:
>>>>
>>>> Caused by: java.util.concurrent.TimeoutException: Invocation of
>>>> [RemoteRpcInvocation(TaskExecutorGateway.submitTask(TaskDeploymentDescriptor,
>>>> JobMasterId, Duration))] at recipient [pekko.tcp://
>>>> [email protected]:6122/user/rpc/taskmanager_0] timed out. This is
>>>> usually caused by: 1) Pekko failed sending the message silently, due to
>>>> problems like oversized payload or serialization failures. In that case,
>>>> you should find detailed error information in the logs. 2) The recipient
>>>> needs more time for responding, due to problems like slow machines or
>>>> network jitters. In that case, you can try to increase pekko.ask.timeout.
>>>>
>>>> To fix this, I increased both pekko.ask.timeout & pekko.framesize
>>>> simultaneously, so I'm not sure which one was the root cause, but in any
>>>> case, is there still no way to monitor if this limit could be reached
>>>> before it happens?
>>>>
>>>> This was with Flink 2.1.1
>>>>
>>>> Regards,
>>>> Alexis.
>>>>
>>>

Reply via email to