[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

2023-03-06 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696770#comment-17696770
 ] 

Roman Khachatryan commented on FLINK-31249:
---

That's doable by writing metadata in a separate (IO) thread and waiting for a 
result with a timeout.

But I'm not sure whether that wouldn't do more harm than good:
 * most of the work was already done by this point (snapshotting the tasks), 
and timing out writing the metadata file (usually small) will discard and start 
it over; that essentially delays the checkpoint
 * and if the timeout is caused by the overload then that next checkpoint is 
much less likely to succeed (because it needs to discard the state written, 
upload it again, write metadata again)
 * in a more narrow case, when it's the IO thread pool that is overloaded (but 
not the IO) - it will be a pure regression

 So I'd avoid such a change without a real world use case.

 

Could you elaborate why the above proposal
{quote}Rather, specific FS implementations can be configured to tinder out too 
long requests.
{quote}
doesn't work in your case?

 

As for the alerts, it should also possible to have them when there are no 
datapoints about recent checkpoints.

> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> ---
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.6, 1.16.0
>Reporter: Renxiang Zhou
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending 
> checkpoint to a completed checkpoint. Currently JM finalizes the pending 
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at 
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread 
> waits for the lock to be released. 
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not 
> execute the timeout event since it is blocked at waiting for the lock. As a 
> result, the previous checkpoint cannot be cancelled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

2023-03-05 Thread Renxiang Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696741#comment-17696741
 ] 

Renxiang Zhou commented on FLINK-31249:
---

Thanks for your reply.

The following checkpoints are all blocked when this case occurred, and the 
checkpoint-related metrics can not report, so the user may not realize that his 
job's checkpoint has blocked for a long time. This is important for tasks with 
high real-time requirements. So could we think about failing this 
checkpoint?User can aware that the checkpoint is stuck by the failure 
checkpoints. 

> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> ---
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.6, 1.16.0
>Reporter: Renxiang Zhou
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending 
> checkpoint to a completed checkpoint. Currently JM finalizes the pending 
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at 
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread 
> waits for the lock to be released. 
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not 
> execute the timeout event since it is blocked at waiting for the lock. As a 
> result, the previous checkpoint cannot be cancelled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

2023-03-03 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696230#comment-17696230
 ] 

Roman Khachatryan commented on FLINK-31249:
---

Allowing to trigger a new checkpoint without unblocking the other (main) thread 
doesn't make much sense to me: at least to process the ACKs for that new 
checkpoint, the main thread is required.

 

Ideally, all IO should be done in a separate thread, but we're not there yet. I 
don't see a way to interrupt writing metadata generically (for any filesystem).

Rather, specific FS implementations can be configured to tinder out too long 
requests.

 

Besides that, the same filesystem usually stores state backend snapshots and 
this metadata. When overloaded, it's more likely that state backend snapshots 
will time out first.

> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> ---
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.6, 1.16.0
>Reporter: Renxiang Zhou
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending 
> checkpoint to a completed checkpoint. Currently JM finalizes the pending 
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at 
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread 
> waits for the lock to be released. 
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not 
> execute the timeout event since it is blocked at waiting for the lock. As a 
> result, the previous checkpoint cannot be cancelled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

2023-03-03 Thread Renxiang Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696116#comment-17696116
 ] 

Renxiang Zhou commented on FLINK-31249:
---

[~roman] If it takes too long to finalize the checkpoint metadata, it usually 
means that there is a problem with the external storage service (in HDFS, it 
could happen when writing to a slow DataNode). In this case, we can retry 
writing a new metadata to DFS or just discard this checkpoint and make another 
one, rather than leaving the checkpoint stuck. What do you think of it ?

> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> ---
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.6, 1.16.0
>Reporter: Renxiang Zhou
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending 
> checkpoint to a completed checkpoint. Currently JM finalizes the pending 
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at 
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread 
> waits for the lock to be released. 
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not 
> execute the timeout event since it is blocked at waiting for the lock. As a 
> result, the previous checkpoint cannot be cancelled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

2023-03-03 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696102#comment-17696102
 ] 

Roman Khachatryan commented on FLINK-31249:
---

[~zhourenxiang] , the timeout is applied to a checkpoint that is already 
started (i.e. RPC sent out to sources). Long checkpoint might potentially 
accumulate too much data (with Unaligned checkpoints) or block progress (with 
Aligned checkpoints).

But here, the checkpoint was not yet triggered, so the tasks aren't even aware 
of it. What would be the benefit of timing it out?

> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> ---
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.6, 1.16.0
>Reporter: Renxiang Zhou
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending 
> checkpoint to a completed checkpoint. Currently JM finalizes the pending 
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at 
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread 
> waits for the lock to be released. 
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not 
> execute the timeout event since it is blocked at waiting for the lock. As a 
> result, the previous checkpoint cannot be cancelled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

2023-03-02 Thread renxiang zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695993#comment-17695993
 ] 

renxiang zhou commented on FLINK-31249:
---

[~roman] 

When it takes too long to finalize the last checkpoint, should we cancel the 
last checkpoint by checkpoint timeout function?

Currently I haven't observed this issue in non-mocked setup, but I think it 
could happen when finalizing checkpoint gets stuck in writing metadata to DFS 
due to a DFS failure, like namenode failure of HDFS.

> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> ---
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.6, 1.16.0
>Reporter: renxiang zhou
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending 
> checkpoint to a completed checkpoint. Currently JM finalizes the pending 
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at 
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread 
> waits for the lock to be released. 
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not 
> execute the timeout event since it is blocked at waiting for the lock. As a 
> result, the previous checkpoint cannot be cancelled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

2023-03-02 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695918#comment-17695918
 ] 

Roman Khachatryan commented on FLINK-31249:
---

[~mayuehappy] , [~zhourenxiang] ,

on the images I see that CheckpointCoordinator.chooseRequestToExecute is 
waiting for the last checkpoint to be finalized. This is intentional to avoid 
concurrency issues.

IIUC, checkpoint finalization is paused artificially.

 

Are you observing any issues with that in non-mocked setup?

> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> ---
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.6, 1.16.0
>Reporter: renxiang zhou
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending 
> checkpoint to a completed checkpoint. Currently JM finalizes the pending 
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at 
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread 
> waits for the lock to be released. 
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not 
> execute the timeout event since it is blocked at waiting for the lock. As a 
> result, the previous checkpoint cannot be cancelled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

2023-03-02 Thread renxiang zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695686#comment-17695686
 ] 

renxiang zhou commented on FLINK-31249:
---

[~Yanfei Lei] [~masteryhx] Count you please have a look at this ticket ? :P

> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> ---
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.6, 1.16.0
>Reporter: renxiang zhou
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending 
> checkpoint to a completed checkpoint. Currently JM finalizes the pending 
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at 
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread 
> waits for the lock to be released. 
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not 
> execute the timeout event since it is blocked at waiting for the lock. As a 
> result, the previous checkpoint cannot be cancelled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

2023-02-28 Thread Yue Ma (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694555#comment-17694555
 ] 

Yue Ma commented on FLINK-31249:


[~roman] [~yunta] could you please take a look at this ticket ? 



> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> ---
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.6, 1.16.0
>Reporter: renxiang zhou
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending 
> checkpoint to a completed checkpoint. Currently JM finalizes the pending 
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at 
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread 
> waits for the lock to be released. 
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not 
> execute the timeout event since it is blocked at waiting for the lock. As a 
> result, the previous checkpoint cannot be cancelled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)