Hi all, I am trying to reduce the memory usage of a Flink app. There is about 25+Gb of state when persisted to checkpoint/savepoint. And a fair amount of short lived objects as incoming traffic is fairly high. So far, I have 8TM with 20GB each using Flink 1.12. I would like to reduce the amount of memory I give, as the state will continue growing. I start my application from an existing savepoint.
Given that CPU is not really an issue, I switched to RocksDB backend, so that state is serialized and supposedly much more compact in memory. I am setting taskmanager.memory.process.size=20000m and taskmanager.memory.managed.size=6000m (and tried other values ranging from 3000m to 10000m). The issue I observed is that the task manager pod memory is increasing during each checkpoint and the 4th checkpoint fails because most of the pods are OOMKilled. There is no java exception in the logs, so I really suspect it is simply RocksDB using more memory than allocated. I explicitly set state.backend.rocksdb.memory.managed=true to be sure. I tried intervals of 2 minutes and 5 minutes for the checkpoint, and it always seems to fail during the 4th checkpoint. I tried incremental checkpoints and after 30 checkpoints no sign of failure so far. I tried with a few GB of overhead memory but that only delays the issue a bit longer. >From the heap usage graph, in all cases, it looks as expected. The heap goes back to a few hundred MB after GC, as the only long lived state is off-heap. Xmx heap is about 12GB but peak usage is at most 6Gb. Am I misconfiguring anything that could explain the OOMKilled pods? Also, what is the best single metric to monitor rocksdb memory usage? (I tried estimate-live-data-size and size-all-mem-tables but I am not fully sure yet about their exact meaning). Best, Alex -- By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/ <https://grab.com/privacy/> This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.