[jira] [Created] (FLINK-34015) execution.savepoint.ignore-unclaimed-state is invalid when passing this parameter by dynamic properties
Renxiang Zhou created FLINK-34015: - Summary: execution.savepoint.ignore-unclaimed-state is invalid when passing this parameter by dynamic properties Key: FLINK-34015 URL: https://issues.apache.org/jira/browse/FLINK-34015 Project: Flink Issue Type: Bug Components: Runtime / State Backends Affects Versions: 1.17.0 Reporter: Renxiang Zhou Attachments: image-2024-01-08-14-22-09-758.png, image-2024-01-08-14-24-30-665.png, image-2024-01-08-14-29-04-347.png We set `execution.savepoint.ignore-unclaimed-state` to true and use -D option to submit the job, but unfortunately we found the value is still false in jobmanager log. Pic 1: we set `execution.savepoint.ignore-unclaimed-state` to true in submiting job. !image-2024-01-08-14-22-09-758.png|width=1012,height=222! Pic 2: The value is still false in jmlog. !image-2024-01-08-14-24-30-665.png|width=651,height=51! Besides, the parameter `execution.savepoint-restore-mode` has the same problem since when we pass it by -D option. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32881) Client supports making savepoints in detach mode
Renxiang Zhou created FLINK-32881: - Summary: Client supports making savepoints in detach mode Key: FLINK-32881 URL: https://issues.apache.org/jira/browse/FLINK-32881 Project: Flink Issue Type: Improvement Components: API / State Processor, Client / Job Submission Affects Versions: 1.19.0 Reporter: Renxiang Zhou Fix For: 1.19.0 Attachments: image-2023-08-16-17-14-34-740.png, image-2023-08-16-17-14-44-212.png When triggering a savepoint using the command-line tool, the client needs to wait for the job to finish creating the savepoint before it can exit. For jobs with large state, the savepoint creation process can be time-consuming, leading to the following problems: # Platform users may need to manage thousands of Flink tasks on a single client machine. With the current savepoint triggering mode, all savepoint creation threads on that machine have to wait for the job to finish the snapshot, resulting in significant resource waste; # If the savepoint producing time exceeds the client's timeout duration, the client will throw a timeout exception and report that the trggering savepoint process fails. Since different jobs have varying savepoint durations, it is difficult to adjust the client's timeout parameter. Therefore, we propose adding a detach mode to trigger savepoints on the client side, just similar to the detach mode behavior when submitting jobs. Here are some specific details: # The savepoint UUID will be generated on the client side. After successfully triggering the savepoint, the client immediately returns the UUID information. # Add a "dump-pending-savepoints" API interface that allows the client to check whether the triggered savepoint has been successfully created. By implementing these changes, the client can detach from the savepoint creation process, reducing resource waste, and providing a way to check the status of savepoint creation. !image-2023-08-16-17-14-34-740.png!!image-2023-08-16-17-14-44-212.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-31249) Checkpoint Timer failed to process timeout events when it blocked at writing _metadata to DFS
renxiang zhou created FLINK-31249: - Summary: Checkpoint Timer failed to process timeout events when it blocked at writing _metadata to DFS Key: FLINK-31249 URL: https://issues.apache.org/jira/browse/FLINK-31249 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing Affects Versions: 1.16.0, 1.11.6 Reporter: renxiang zhou Fix For: 1.18.0 Attachments: image-2023-02-28-11-25-03-637.png The jobmanager-future thread may be blocked at writing metadata to DFS caused by a DFS failure, and the CheckpointCoordinator Lock is hold by this thread. When the next Checkpoint is triggered, the Checkpoint Timer thread waits for the lock to be released. If the previous checkpoint times out, the checkpoint timer does not execute the timeout event since it is blocked at waiting for the lock. As a result, the previous checkpoint cannot be cancelled. !image-2023-02-28-11-25-03-637.png|width=1144,height=248! -- This message was sent by Atlassian Jira (v8.20.10#820010)