Hi We had something similar and our problem was class loader leaks. We used a summary log component to reduce logging but still turned out that it used a static object that wasn’t released when we got an OOM or restart. Flink was reusing task managers so only workaround was to stop the job wait until they was removed and start again until we fixed the underlying problem.
Med venlig hilsen / Best regards Lasse Nedergaard > Den 3. feb. 2021 kl. 02.54 skrev Xintong Song <tonysong...@gmail.com>: > > >> How is the memory measured? > I meant which flink or k8s metric is collected? I'm asking because depending > on which metric is used, the *container memory usage* can be defined > differently. E.g., whether mmap memory is included. > > Also, could you share the effective memory configurations for the > taskmanagers? You should find something like the following at the beginning > of taskmanger logs. > >> INFO [] - Final TaskExecutor Memory configuration: >> INFO [] - Total Process Memory: 1.688gb (1811939328 bytes) >> INFO [] - Total Flink Memory: 1.250gb (1342177280 bytes) >> INFO [] - Total JVM Heap Memory: 512.000mb (536870902 bytes) >> INFO [] - Framework: 128.000mb (134217728 bytes) >> INFO [] - Task: 384.000mb (402653174 bytes) >> INFO [] - Total Off-heap Memory: 768.000mb (805306378 bytes) >> INFO [] - Managed: 512.000mb (536870920 bytes) >> INFO [] - Total JVM Direct Memory: 256.000mb (268435458 bytes) >> INFO [] - Framework: 128.000mb (134217728 bytes) >> INFO [] - Task: 0 bytes >> INFO [] - Network: 128.000mb (134217730 bytes) >> INFO [] - JVM Metaspace: 256.000mb (268435456 bytes) >> INFO [] - JVM Overhead: 192.000mb (201326592 bytes) > > Thank you~ > Xintong Song > > >> On Tue, Feb 2, 2021 at 8:59 PM Randal Pitt <randal.p...@foresite.com> wrote: >> Hi Xintong Song, >> >> Correct, we are using standalone k8s. Task managers are deployed as a >> statefulset so have consistent pod names. We tried using native k8s (in fact >> I'd prefer to) but got persistent >> "io.fabric8.kubernetes.client.KubernetesClientException: too old resource >> version: 242214695 (242413759)" errors which resulted in jobs being >> restarted every 30-60 minutes. >> >> We are using Prometheus Node Exporter to capture memory usage. The graph >> shows the metric: >> >> sum(container_memory_usage_bytes{container_name="taskmanager",pod_name=~"$flink_task_manager"}) >> by (pod_name) >> >> I've attached the original >> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2869/Screenshot_2021-02-02_at_11.png> >> >> so Nabble doesn't shrink it. >> >> Best regards, >> >> Randal. >> >> >> >> >> >> -- >> Sent from: >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/