[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696770#comment-17696770 ] Roman Khachatryan commented on FLINK-31249: --- That's doable by writing metadata in a separate (IO) thread and waiting for a result with a timeout. But I'm not sure whether that wouldn't do more harm than good: * most of the work was already done by this point (snapshotting the tasks), and timing out writing the metadata file (usually small) will discard and start it over; that essentially delays the checkpoint * and if the timeout is caused by the overload then that next checkpoint is much less likely to succeed (because it needs to discard the state written, upload it again, write metadata again) * in a more narrow case, when it's the IO thread pool that is overloaded (but not the IO) - it will be a pure regression So I'd avoid such a change without a real world use case. Could you elaborate why the above proposal {quote}Rather, specific FS implementations can be configured to tinder out too long requests. {quote} doesn't work in your case? As for the alerts, it should also possible to have them when there are no datapoints about recent checkpoints. > Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck > --- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.6, 1.16.0 >Reporter: Renxiang Zhou >Priority: Major > Fix For: 1.18.0 > > Attachments: image-2023-02-28-11-25-03-637.png, > image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png > > > When jobmanager receives all ACKs of tasks, it will finalize the pending > checkpoint to a completed checkpoint. Currently JM finalizes the pending > checkpoint with holding the checkpoint coordinator lock. > When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at > finalizing the pending checkpoint. > !image-2023-02-28-12-17-19-607.png|width=1010,height=244! > And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread > waits for the lock to be released. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! > If the previous checkpoint times out, the {{Checkpoint Timer}} will not > execute the timeout event since it is blocked at waiting for the lock. As a > result, the previous checkpoint cannot be cancelled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696741#comment-17696741 ] Renxiang Zhou commented on FLINK-31249: --- Thanks for your reply. The following checkpoints are all blocked when this case occurred, and the checkpoint-related metrics can not report, so the user may not realize that his job's checkpoint has blocked for a long time. This is important for tasks with high real-time requirements. So could we think about failing this checkpoint?User can aware that the checkpoint is stuck by the failure checkpoints. > Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck > --- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.6, 1.16.0 >Reporter: Renxiang Zhou >Priority: Major > Fix For: 1.18.0 > > Attachments: image-2023-02-28-11-25-03-637.png, > image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png > > > When jobmanager receives all ACKs of tasks, it will finalize the pending > checkpoint to a completed checkpoint. Currently JM finalizes the pending > checkpoint with holding the checkpoint coordinator lock. > When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at > finalizing the pending checkpoint. > !image-2023-02-28-12-17-19-607.png|width=1010,height=244! > And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread > waits for the lock to be released. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! > If the previous checkpoint times out, the {{Checkpoint Timer}} will not > execute the timeout event since it is blocked at waiting for the lock. As a > result, the previous checkpoint cannot be cancelled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696230#comment-17696230 ] Roman Khachatryan commented on FLINK-31249: --- Allowing to trigger a new checkpoint without unblocking the other (main) thread doesn't make much sense to me: at least to process the ACKs for that new checkpoint, the main thread is required. Ideally, all IO should be done in a separate thread, but we're not there yet. I don't see a way to interrupt writing metadata generically (for any filesystem). Rather, specific FS implementations can be configured to tinder out too long requests. Besides that, the same filesystem usually stores state backend snapshots and this metadata. When overloaded, it's more likely that state backend snapshots will time out first. > Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck > --- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.6, 1.16.0 >Reporter: Renxiang Zhou >Priority: Major > Fix For: 1.18.0 > > Attachments: image-2023-02-28-11-25-03-637.png, > image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png > > > When jobmanager receives all ACKs of tasks, it will finalize the pending > checkpoint to a completed checkpoint. Currently JM finalizes the pending > checkpoint with holding the checkpoint coordinator lock. > When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at > finalizing the pending checkpoint. > !image-2023-02-28-12-17-19-607.png|width=1010,height=244! > And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread > waits for the lock to be released. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! > If the previous checkpoint times out, the {{Checkpoint Timer}} will not > execute the timeout event since it is blocked at waiting for the lock. As a > result, the previous checkpoint cannot be cancelled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696116#comment-17696116 ] Renxiang Zhou commented on FLINK-31249: --- [~roman] If it takes too long to finalize the checkpoint metadata, it usually means that there is a problem with the external storage service (in HDFS, it could happen when writing to a slow DataNode). In this case, we can retry writing a new metadata to DFS or just discard this checkpoint and make another one, rather than leaving the checkpoint stuck. What do you think of it ? > Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck > --- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.6, 1.16.0 >Reporter: Renxiang Zhou >Priority: Major > Fix For: 1.18.0 > > Attachments: image-2023-02-28-11-25-03-637.png, > image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png > > > When jobmanager receives all ACKs of tasks, it will finalize the pending > checkpoint to a completed checkpoint. Currently JM finalizes the pending > checkpoint with holding the checkpoint coordinator lock. > When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at > finalizing the pending checkpoint. > !image-2023-02-28-12-17-19-607.png|width=1010,height=244! > And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread > waits for the lock to be released. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! > If the previous checkpoint times out, the {{Checkpoint Timer}} will not > execute the timeout event since it is blocked at waiting for the lock. As a > result, the previous checkpoint cannot be cancelled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696102#comment-17696102 ] Roman Khachatryan commented on FLINK-31249: --- [~zhourenxiang] , the timeout is applied to a checkpoint that is already started (i.e. RPC sent out to sources). Long checkpoint might potentially accumulate too much data (with Unaligned checkpoints) or block progress (with Aligned checkpoints). But here, the checkpoint was not yet triggered, so the tasks aren't even aware of it. What would be the benefit of timing it out? > Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck > --- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.6, 1.16.0 >Reporter: Renxiang Zhou >Priority: Major > Fix For: 1.18.0 > > Attachments: image-2023-02-28-11-25-03-637.png, > image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png > > > When jobmanager receives all ACKs of tasks, it will finalize the pending > checkpoint to a completed checkpoint. Currently JM finalizes the pending > checkpoint with holding the checkpoint coordinator lock. > When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at > finalizing the pending checkpoint. > !image-2023-02-28-12-17-19-607.png|width=1010,height=244! > And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread > waits for the lock to be released. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! > If the previous checkpoint times out, the {{Checkpoint Timer}} will not > execute the timeout event since it is blocked at waiting for the lock. As a > result, the previous checkpoint cannot be cancelled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695993#comment-17695993 ] renxiang zhou commented on FLINK-31249: --- [~roman] When it takes too long to finalize the last checkpoint, should we cancel the last checkpoint by checkpoint timeout function? Currently I haven't observed this issue in non-mocked setup, but I think it could happen when finalizing checkpoint gets stuck in writing metadata to DFS due to a DFS failure, like namenode failure of HDFS. > Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck > --- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.6, 1.16.0 >Reporter: renxiang zhou >Priority: Major > Fix For: 1.18.0 > > Attachments: image-2023-02-28-11-25-03-637.png, > image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png > > > When jobmanager receives all ACKs of tasks, it will finalize the pending > checkpoint to a completed checkpoint. Currently JM finalizes the pending > checkpoint with holding the checkpoint coordinator lock. > When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at > finalizing the pending checkpoint. > !image-2023-02-28-12-17-19-607.png|width=1010,height=244! > And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread > waits for the lock to be released. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! > If the previous checkpoint times out, the {{Checkpoint Timer}} will not > execute the timeout event since it is blocked at waiting for the lock. As a > result, the previous checkpoint cannot be cancelled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695918#comment-17695918 ] Roman Khachatryan commented on FLINK-31249: --- [~mayuehappy] , [~zhourenxiang] , on the images I see that CheckpointCoordinator.chooseRequestToExecute is waiting for the last checkpoint to be finalized. This is intentional to avoid concurrency issues. IIUC, checkpoint finalization is paused artificially. Are you observing any issues with that in non-mocked setup? > Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck > --- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.6, 1.16.0 >Reporter: renxiang zhou >Priority: Major > Fix For: 1.18.0 > > Attachments: image-2023-02-28-11-25-03-637.png, > image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png > > > When jobmanager receives all ACKs of tasks, it will finalize the pending > checkpoint to a completed checkpoint. Currently JM finalizes the pending > checkpoint with holding the checkpoint coordinator lock. > When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at > finalizing the pending checkpoint. > !image-2023-02-28-12-17-19-607.png|width=1010,height=244! > And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread > waits for the lock to be released. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! > If the previous checkpoint times out, the {{Checkpoint Timer}} will not > execute the timeout event since it is blocked at waiting for the lock. As a > result, the previous checkpoint cannot be cancelled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695686#comment-17695686 ] renxiang zhou commented on FLINK-31249: --- [~Yanfei Lei] [~masteryhx] Count you please have a look at this ticket ? :P > Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck > --- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.6, 1.16.0 >Reporter: renxiang zhou >Priority: Major > Fix For: 1.18.0 > > Attachments: image-2023-02-28-11-25-03-637.png, > image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png > > > When jobmanager receives all ACKs of tasks, it will finalize the pending > checkpoint to a completed checkpoint. Currently JM finalizes the pending > checkpoint with holding the checkpoint coordinator lock. > When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at > finalizing the pending checkpoint. > !image-2023-02-28-12-17-19-607.png|width=1010,height=244! > And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread > waits for the lock to be released. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! > If the previous checkpoint times out, the {{Checkpoint Timer}} will not > execute the timeout event since it is blocked at waiting for the lock. As a > result, the previous checkpoint cannot be cancelled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17694555#comment-17694555 ] Yue Ma commented on FLINK-31249: [~roman] [~yunta] could you please take a look at this ticket ? > Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck > --- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.6, 1.16.0 >Reporter: renxiang zhou >Priority: Major > Fix For: 1.18.0 > > Attachments: image-2023-02-28-11-25-03-637.png, > image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png > > > When jobmanager receives all ACKs of tasks, it will finalize the pending > checkpoint to a completed checkpoint. Currently JM finalizes the pending > checkpoint with holding the checkpoint coordinator lock. > When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at > finalizing the pending checkpoint. > !image-2023-02-28-12-17-19-607.png|width=1010,height=244! > And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread > waits for the lock to be released. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! > If the previous checkpoint times out, the {{Checkpoint Timer}} will not > execute the timeout event since it is blocked at waiting for the lock. As a > result, the previous checkpoint cannot be cancelled. -- This message was sent by Atlassian Jira (v8.20.10#820010)