[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191391#comment-17191391 ] Zhu Zhu commented on FLINK-18641: - Thanks for the updates! [~becket_qin] > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Assignee: Jiangjie Qin >Priority: Blocker > Labels: pull-request-available > Fix For: 1.12.0, 1.11.2 > > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191389#comment-17191389 ] Jiangjie Qin commented on FLINK-18641: -- [~zhuzh] Just some minor items left, should be checked in soon. > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Assignee: Jiangjie Qin >Priority: Blocker > Labels: pull-request-available > Fix For: 1.12.0, 1.11.2 > > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190612#comment-17190612 ] Zhu Zhu commented on FLINK-18641: - Hi [~becket_qin], that's the status of the pull request? > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Assignee: Jiangjie Qin >Priority: Blocker > Labels: pull-request-available > Fix For: 1.12.0, 1.11.2 > > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170530#comment-17170530 ] Jiangjie Qin commented on FLINK-18641: -- [~pnowojski] [~SleePy] Somehow the PR info was not updated in the ticket. Would you help take a look? [https://github.com/apache/flink/pull/13044] > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Assignee: Jiangjie Qin >Priority: Major > Labels: pull-request-available > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166925#comment-17166925 ] Jiangjie Qin commented on FLINK-18641: -- [~SleePy] Sorry for the late reply. I have a fix ready and is working on the tests. Will submit a patch this week. > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Priority: Major > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166057#comment-17166057 ] Biao Liu commented on FLINK-18641: -- [~becket_qin], [~pnowojski], the ticket has not been assigned yet. Is there anyone working on this? > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Priority: Major > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164178#comment-17164178 ] Jiangjie Qin commented on FLINK-18641: -- [~pnowojski] Synchronously waiting for the hook's future is not sufficient after introduction of OperatorCoordinator. This is because when the master hooks are triggered on the {{ExternallyInducedSource}}, the tasks may start to checkpoint the operators before the OperatorCoordinator is checkpointed. This breaks the checkpoint contract between the OperatorCoordinator and the Operator, which is that the OperatorCoordinator is always checkpointed before the subtasks/operators are checkpointed. Otherwise, yes, we only need to wait for the master hooks to finish before completing the checkpoint. > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Priority: Major > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163425#comment-17163425 ] Piotr Nowojski commented on FLINK-18641: [~becket_qin] yes you are right that the problem was introduced in FLINK-13905 in 1.11.0. {quote} By design, the checkpoint should always actually take the snapshot of the master hooks and OperatorCoordinator first before taking the checkpoint on the tasks. {quote} Regarding the ordering guarantees. I don't understand where does this strict ordering comes from? Java docs of the {{MasterTriggerRestoreHook}} doesn't seem to support this statement: {code} * If the action should be executed asynchronously and only needs to complete before the * checkpoint is considered completed, then the method may use the given executor to execute the * actual action and would signal its completion by completing the future. {code} So synchronously waiting for the hook's future to complete would solve the problem, but what I proposed above: {quote} And the solution would be to include waiting for async hook to complete, before completing/finalising the checkpoint? {quote} Should also be correct, right? As a sidenote, I'm +1 for the easiest solution to solve this bug without causing regressions compared to 1.10/1.9 and without undermining {{CheckpointCoordinator}} threading model refactor (which we still need to complete). As you both mentioned, {{MasterTriggerRestoreHook}} is on it's way out to be replaced by FLIP-27. > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Priority: Major > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163359#comment-17163359 ] Jiangjie Qin commented on FLINK-18641: -- [~SleePy] Right, the master hooks with ExternalInducedSource are not really that robust in my opinion. I think the fix is just a few lines of code. However, apparently we do not have a test covering the use case of {{ExternallyInducedSource}}. So as usual, I think the major work here will be writing the tests. > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Priority: Major > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163341#comment-17163341 ] Biao Liu commented on FLINK-18641: -- Ah, sorry [~pnowojski], I forgot this change is not submitted to release-1.10 at that time, it's only submitted to master. [~becket_qin], thanks for correcting it. There are two different things of your response, master hook and {{OperatorCoordinator}}. Regarding to this ticket, it's not caused by the {{OperatorCoordinator}} part, right? I think we could file another ticket for it and discuss it there. {quote}By design, the checkpoint should always actually take the snapshot of the master hooks and OperatorCoordinator first before taking the checkpoint on the tasks.{quote} Technically speaking, there is no clear semantics that master hook should be taken before task snapshotting. For the {{ExternallyInducedSource}}, the task snapshotting might be taken before master hook finishes the future returned. And if there are multiple master hooks, some hooks might be invoked after task snapshotting. It's concurrent somewhat. I don't think we should/could guarantee the ordering here. Anyway we have to fix the issue of {{ExternallyInducedSource}} caused by the ordering. Regarding to the fixing plan. I'm not sure how heavy the fixing of {{OperatorCoordinator}} might be. If it's not a simple fixing, it might be better to separate these things into different patches. As a quick fixing of this ticket, we could take the master hook synchronously with the coordinator-wide lock retaining just like before. {quote}I agree that In long run, the operator coordinator can actually supersede the master hooks. So we can probably mark the master hooks as deprecated.{quote} Totally agree! > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Priority: Major > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163267#comment-17163267 ] Jiangjie Qin commented on FLINK-18641: -- [~pnowojski] [~SleePy] There are actually more problems than the issues reported by this ticket. In checkpoint there are two orders: # the checkpoint taking order; # the checkpoint acknowledge handling order. This ticket reports the problem of unexpected checkpoint acknowledge handling order. While the acknowledge handling order is not super important, the checkpoint taking order is critical because that may cause correctness problem. By design, the checkpoint should always actually take the snapshot of the master hooks and OperatorCoordinator first before taking the checkpoint on the tasks. Prior to FLIP-27, this ordering is always guaranteed because the checkpoint on the tasks are triggered in one of the following ways: # The CheckpointCoordinator takes snapshot of the master hooks, wait for it to finish, and then trigger the snapshot on tasks. (Strict snapshot order is guaranteed by the checkpoint coordinator.) # The CheckpointCoordinator takes snapshot of the master hooks. The master hook triggers the checkpoint on the tasks using ExternallyInducedSource. The CheckpointCoordinator waits until the master hooks finishes snapshot. (In this case, the master hooks have to guarantee the consistency between the master state and the task states. The tasks may ack the checkpoint before the master hooks completes, but the acks won't be handled until the master hooks acks are handled). After FLIP-27, with the introduction of OperatorCoordinator, we need to make sure that the tasks checkpoint cannot be triggered before the OperatorCoordinator checkpoint. Given that the task checkpoint can be triggered by both the CheckpointCoordinator itself or the master hooks. We should first checkpoint the OperatorCoordinator, then the master hooks, lastly triggering the task checkpoint. The current code does not do this correctly. That means the master hooks can trigger the snapshot of the tasks before the OperatorCoordinators are checkpointed. Regarding the fix, I am thinking of the following: # Change the checkpoint order to checkpoint OperatorCoordinators first, i.e before checkpointing master hooks. # Keep the master hooks async, but invoke them after checkpointing the OperatorCoordinators. This is fine because according to the java doc of {{MasterTriggerRestoreHook#triggerCheckpoint()}} can choose to be synchronous if it wants to. Otherwise the CheckpointCoordinator will consider it as asynchronous friendly. # Change the logic to only call {{CheckpointCoordinator#completePendingCheckpoint()}} after all the checkpoints are acked, instead of only looking the task checkpoints. This is because the acknowledge handling order may vary given the current async checkpoint pattern. I agree that In long run, the operator coordinator can actually supersede the master hooks. So we can probably mark the master hooks as deprecated. > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Priority: Major > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at >
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163239#comment-17163239 ] Jiangjie Qin commented on FLINK-18641: -- [~pnowojski] In Flink 1.10, the checkpoint thread blocks on the future returned by the master hook, therefore that thread will not handle the acks from the tasks until the master hook has completed. [https://github.com/apache/flink/blob/bfe6c2eddedaf3b8067c973ba82a06b36d5095f9/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L627] Now in Flink 1.11, the checkpoint thread no longer blocks on that future anymore, it uses handleAsync instead. [https://github.com/apache/flink/blob/791e276c8346a49130cb096bafa128d7f1231236/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L708] [https://github.com/apache/flink/blob/791e276c8346a49130cb096bafa128d7f1231236/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L544] So that is what I meant by {quote}In Flink 1.10, the checkpoint thread blocks on the future returned by the master hook. This becomes asynchronous in Flink 1.11. {quote} > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Priority: Major > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162667#comment-17162667 ] Piotr Nowojski commented on FLINK-18641: So the problem is that checkpoint is finalised after JM receives all of the ACKS, without waiting for async master hook to complete? And the solution would be to include waiting for async hook to complete, before completing/finalising the checkpoint? {quote} In Flink 1.10, the checkpoint thread blocks on the future returned by the master hook. This becomes asynchronous in Flink 1.11. {quote} [~becket_qin], as [~SleePy] mentioned that's not correct. Those changes were done already in Flink 1.10. > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Priority: Major > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162553#comment-17162553 ] Biao Liu commented on FLINK-18641: -- Thanks [~becket_qin] for analyzing this issue. The asynchronous checkpoint threading model breaks the assumption of {{ExternallyInducedSource}} could trigger a checkpoint before {{MasterTriggerRestoreHook}} finishes the trigger future. We can find a way to guarantee that. However it's not friendly to the scenario that doing some initialization or preparation in {{MasterTriggerRestoreHook}}. Because checkpoint might be triggered before initialization of preparation finishes. Hope nobody uses it like that :( Currently the semantics of {{ExternallyInducedSource}} is highly bound with the implementation of Flink checkpoint which should be avoided IMO. I think we should redesign the {{ExternallyInducedSource}} as a long-term goal. To [~becket_qin], do you already have any idea for fixing it? If not, I could help to fix it. BTW, this change of {{CheckpointCoordinator}} is introduced in 1.10. Is it possible that the failure of testing case is exposed by the change of {{OperatorCoordinator}}? Because we add another asynchronous step between master hook triggering and task triggering. I'm not sure if there must be some {{OperatorCoordinator}} added or not in the scenario of Pravega connector testing. If not, there is a work-around way that try to finish future returned by {{MasterTriggerRestoreHook.triggerCheckpoint}} before trigger task checkpoint (I assume there is only one master hook in the case). > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Priority: Major > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162428#comment-17162428 ] Brian Zhou commented on FLINK-18641: Thanks Jiangjie for looking into this. I noticed the vote for 1.11.1 release candidate has passed 2 days ago, so does that mean this one could not get into the 1.11.1 release? This one is now a blocker to Pravega Flink 1.11 connector, so I hope this could be fixed ASAP. > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Priority: Major > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162410#comment-17162410 ] Jiangjie Qin commented on FLINK-18641: -- Looked into this a little bit. The problems is following: # The Pravega connector uses the {{MasterTriggerRestoreHook}} and {{ExternallyInducedSource}} for checkpointing. That means the tasks may start checkpointing after the {{MasterTriggerRestoreHook.triggerCheckpoint()}} is invoked. # In Flink 1.10, the checkpoint thread blocks on the future returned by the master hook. This becomes asynchronous in Flink 1.11. It only guarantees that the JM triggers the task checkpoint after the master state is checkpointed. However, this does not cover the case where the task checkpoints are triggered by the master hooks rather than by JM itself. # Due to the change in (2), what happens in this case is that the task checkpionts are triggered by the master hook, and acknowledged before the master hook future is completed. From JM's perspective, once all the task checkpoints are acked, it will try to finalize the checkpoint, but found that the master checkpoint has not completed yet, thus an exception is thrown. This is a bug in JM when working with {{MasterTriggerRestoreHook}} and {{ExternallyInducedSource.}} I'll submit a fix for this. > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Priority: Major > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162056#comment-17162056 ] Piotr Nowojski commented on FLINK-18641: [~SleePy] do you have some good guess what's the problem here? > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Priority: Major > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)