Rui Fan created FLINK-33092: ------------------------------- Summary: Improve the resource-stabilization-timeout mechanism when rescale a job for Adaptive Scheduler Key: FLINK-33092 URL: https://issues.apache.org/jira/browse/FLINK-33092 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Reporter: Rui Fan Assignee: Rui Fan Attachments: image-2023-09-15-14-43-35-104.png
!image-2023-09-15-14-43-35-104.png|width=776,height=548! h1. 1. Propose The above is the state transition graph when rescale a job in Adaptive Scheduler. In brief, when we trigger a rescale, the job will wait _*resource-stabilization-timeout*_ in WaitingForResources State when it has sufficient resources and it doesn't have the desired resource. If the _*resource-stabilization-timeout mechanism*_ is moved into the Executing State, the rescale downtime will be significantly reduced. h1. 2. Why the downtime is long?can be significantly reduced Currently, when rescale a job: * The Executing will transition to Restarting * The Restarting will cancel this job first. * The Restarting will transition to WaitingForResources after the whole job is terminal. * When this job has sufficient resources and it doesn't have the desired resource, the WaitingForResources needs to wait _*resource-stabilization-timeout*_ . * WaitingForResources will transition to CreatingExecutionGraph after resource-stabilization-timeout. The problem is the job isn't running during the resource-stabilization-timeout phase. h1. 3. How to reduce the downtime? We can move the _*resource-stabilization-timeout mechanism*_ into the Executing State when trigger a rescale. It means: * When this job has desired resources, the Executing can rescale directly. * When this job has sufficient resources and it doesn't have the desired resource, we can rescale after _*resource-stabilization-timeout.*_ * The WaitingForResources will ignore the resource-stabilization-timeout after this improvement. The resource-stabilization-timeout works before cancel job, so the rescale downtime will be significantly reduced. Note: the resource-stabilization-timeout still works in WaitingForResources when start a job. It's just changed when rescale a job. -- This message was sent by Atlassian Jira (v8.20.10#820010)