[ https://issues.apache.org/jira/browse/FLINK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Piotr Nowojski reassigned FLINK-25958: -------------------------------------- Assignee: Anton Kalashnikov > OOME Checkpoints & Savepoints were shown as COMPLETE in Flink UI > ---------------------------------------------------------------- > > Key: FLINK-25958 > URL: https://issues.apache.org/jira/browse/FLINK-25958 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.15.0, 1.12.7, 1.13.5, 1.14.3 > Environment: Ververica Platform 2.6.2 > Flink 1.13.5 > Reporter: Victor Xu > Assignee: Anton Kalashnikov > Priority: Major > Attachments: JIRA-1.jpg > > > Flink job was running but the checkpoints & savepoints were failing all the > time due to OOM Exception. However, the Flink UI showed COMPLETE for those > checkpoints & savepoints. > For example (checkpoint 39 & 40): > {noformat} > 2022-01-27 02:41:39,969 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering > checkpoint 39 (type=CHECKPOINT) @ 1643251299952 for job > ab2217e5ce144087bbddf6bd6c3 > 668eb. > 2022-01-27 02:43:19,678 WARN org.apache.flink.runtime.jobmaster.JobMaster > [] - Error while processing AcknowledgeCheckpoint message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete > the pending checkpoint 39. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1227) > ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1072) > ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] > at > org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89) > ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] > at > org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) > ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-s > tream2] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at java.lang.Thread.run(Thread.java:829) [?:?] > Caused by: java.lang.IllegalArgumentException: Self-suppression not permitted > at java.lang.Throwable.addSuppressed(Throwable.java:1054) ~[?:?] > at > org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:627) > ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] > at > com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.serializeCheckpoint(KubernetesHaCheckpointStore.java:204) > ~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.09 > 1138-2.jar:?] > at > com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.addCheckpoint(KubernetesHaCheckpointStore.java:83) > ~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.091138-2. > jar:?] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1209) > ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] > ... 9 more > Caused by: java.lang.OutOfMemoryError: Java heap space > 2022-01-27 03:41:39,970 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering > checkpoint 40 (type=CHECKPOINT) @ 1643254899952 for job > ab2217e5ce144087bbddf6bd6c3 > 668eb. > 2022-01-27 03:43:22,326 WARN org.apache.flink.runtime.jobmaster.JobMaster > [] - Error while processing AcknowledgeCheckpoint message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete > the pending checkpoint 40. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1227) > ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1072) > ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] > at > org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89) > ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] > at > org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) > ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-s > tream2] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at java.lang.Thread.run(Thread.java:829) [?:?] > Caused by: java.lang.IllegalArgumentException: Self-suppression not permitted > at java.lang.Throwable.addSuppressed(Throwable.java:1054) ~[?:?] > at > org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:627) > ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] > at > com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.serializeCheckpoint(KubernetesHaCheckpointStore.java:204) > ~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.09 > 1138-2.jar:?] > at > com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.addCheckpoint(KubernetesHaCheckpointStore.java:83) > ~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.091138-2.jar:?] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1209) > ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] > ... 9 more > Caused by: java.lang.OutOfMemoryError: Java heap space{noformat} > Please find attached a screenshot of the Flink UI (both 39 & 40 were > COMPLETE). > -- This message was sent by Atlassian Jira (v8.20.1#820001)