Hi all,

I am trying to reduce the memory usage of a Flink app.
There is about 25+Gb of state when persisted to checkpoint/savepoint. And a
fair amount of short lived objects as incoming traffic is fairly high.
So far, I have 8TM with 20GB each using Flink 1.12. I would like to reduce
the amount of memory I give, as the state will continue growing. I start my
application from an existing savepoint.

Given that CPU is not really an issue, I  switched to RocksDB backend, so
that state is serialized and supposedly much more compact in memory.
I am setting taskmanager.memory.process.size=20000m and
taskmanager.memory.managed.size=6000m
(and tried other values ranging from 3000m to 10000m).

The issue I observed is that the task manager pod memory is increasing
during each checkpoint and the 4th checkpoint fails because most of the
pods are OOMKilled. There is no java exception in the logs, so I really
suspect it is simply RocksDB using more memory than allocated.
I explicitly set state.backend.rocksdb.memory.managed=true to be sure.
I tried intervals of 2 minutes and 5 minutes for the checkpoint, and it
always seems to fail during the 4th checkpoint.

I tried incremental checkpoints and after 30 checkpoints no sign of failure
so far.

I tried with a few GB of overhead memory but that only delays the issue a
bit longer.
>From the heap usage graph, in all cases, it looks as expected. The heap
goes back to a few hundred MB after GC, as the only long lived state is
off-heap. Xmx heap is about 12GB but peak usage is at most 6Gb.


Am I misconfiguring anything that could explain the OOMKilled pods?

Also, what is the best single metric to monitor rocksdb memory usage?  (I
tried estimate-live-data-size and size-all-mem-tables but I am not fully
sure yet about their exact meaning).

Best,
Alex

-- 


By communicating with Grab Inc and/or its subsidiaries, associate 
companies and jointly controlled entities (“Grab Group”), you are deemed to 
have consented to the processing of your personal data as set out in the 
Privacy Notice which can be viewed at https://grab.com/privacy/ 
<https://grab.com/privacy/>


This email contains confidential information 
and is only for the intended recipient(s). If you are not the intended 
recipient(s), please do not disseminate, distribute or copy this email 
Please notify Grab Group immediately if you have received this by mistake 
and delete this email from your system. Email transmission cannot be 
guaranteed to be secure or error-free as any information therein could be 
intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain 
viruses. Grab Group do not accept liability for any errors or omissions in 
the contents of this email arises as a result of email transmission. All 
intellectual property rights in this email and attachments therein shall 
remain vested in Grab Group, unless otherwise provided by law.

Reply via email to