[jira] [Created] (FLINK-34015) execution.savepoint.ignore-unclaimed-state is invalid when passing this parameter by dynamic properties

2024-01-07 Thread Renxiang Zhou (Jira)
Renxiang Zhou created FLINK-34015:
-

 Summary: execution.savepoint.ignore-unclaimed-state is invalid 
when passing this parameter by dynamic properties
 Key: FLINK-34015
 URL: https://issues.apache.org/jira/browse/FLINK-34015
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.17.0
Reporter: Renxiang Zhou
 Attachments: image-2024-01-08-14-22-09-758.png, 
image-2024-01-08-14-24-30-665.png, image-2024-01-08-14-29-04-347.png

We set `execution.savepoint.ignore-unclaimed-state` to true and use -D option 
to submit the job, but unfortunately we found the value is still false in 
jobmanager log.

Pic 1: we  set `execution.savepoint.ignore-unclaimed-state` to true in 
submiting job.
!image-2024-01-08-14-22-09-758.png|width=1012,height=222!

Pic 2: The value is still false in jmlog.

!image-2024-01-08-14-24-30-665.png|width=651,height=51!

 

Besides, the parameter `execution.savepoint-restore-mode` has the same problem 
since when we pass it by -D option.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32881) Client supports making savepoints in detach mode

2023-08-16 Thread Renxiang Zhou (Jira)
Renxiang Zhou created FLINK-32881:
-

 Summary: Client supports making savepoints in detach mode
 Key: FLINK-32881
 URL: https://issues.apache.org/jira/browse/FLINK-32881
 Project: Flink
  Issue Type: Improvement
  Components: API / State Processor, Client / Job Submission
Affects Versions: 1.19.0
Reporter: Renxiang Zhou
 Fix For: 1.19.0
 Attachments: image-2023-08-16-17-14-34-740.png, 
image-2023-08-16-17-14-44-212.png

When triggering a savepoint using the command-line tool, the client needs to 
wait for the job to finish creating the savepoint before it can exit. For jobs 
with large state, the savepoint creation process can be time-consuming, leading 
to the following problems:
 # Platform users may need to manage thousands of Flink tasks on a single 
client machine. With the current savepoint triggering mode, all savepoint 
creation threads on that machine have to wait for the job to finish the 
snapshot, resulting in significant resource waste;
 # If the savepoint producing time exceeds the client's timeout duration, the 
client will throw a timeout exception and report that the trggering savepoint 
process fails. Since different jobs have varying savepoint durations, it is 
difficult to adjust the client's timeout parameter.

Therefore, we propose adding a detach mode to trigger savepoints on the client 
side, just similar to the detach mode behavior when submitting jobs. Here are 
some specific details:
 # The savepoint UUID will be generated on the client side. After successfully 
triggering the savepoint, the client immediately returns the UUID information.
 # Add a "dump-pending-savepoints" API interface that allows the client to 
check whether the triggered savepoint has been successfully created.

By implementing these changes, the client can detach from the savepoint 
creation process, reducing resource waste, and providing a way to check the 
status of savepoint creation.

!image-2023-08-16-17-14-34-740.png!!image-2023-08-16-17-14-44-212.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31249) Checkpoint Timer failed to process timeout events when it blocked at writing _metadata to DFS

2023-02-27 Thread renxiang zhou (Jira)
renxiang zhou created FLINK-31249:
-

 Summary: Checkpoint Timer failed to process timeout events when it 
blocked at writing _metadata to DFS
 Key: FLINK-31249
 URL: https://issues.apache.org/jira/browse/FLINK-31249
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing
Affects Versions: 1.16.0, 1.11.6
Reporter: renxiang zhou
 Fix For: 1.18.0
 Attachments: image-2023-02-28-11-25-03-637.png

The jobmanager-future thread may be blocked at writing metadata to DFS caused 
by a DFS failure, and the CheckpointCoordinator Lock is hold by this thread. 

When the next Checkpoint is triggered, the Checkpoint Timer thread waits for 
the lock to be released.  If the previous checkpoint times out, the checkpoint 
timer does not execute the timeout event since it is blocked at waiting for the 
lock. As a result, the previous checkpoint cannot be cancelled.

!image-2023-02-28-11-25-03-637.png|width=1144,height=248!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)