Re: Autoscaling Global Scaling Factor (???)

Zhanghao Chen Sun, 17 Aug 2025 19:16:44 -0700

Hi Salva,

I'm currently out of bandwidth. Looking forward to your proposal!

Best,
Zhanghao Chen
________________________________
From: Salva Alcántara <salcantara...@gmail.com>
Sent: Saturday, August 16, 2025 3:23
To: user <user@flink.apache.org>
Subject: Re: Autoscaling Global Scaling Factor (???)

Yeah I was thinking of enforcing that restriction myself by taking the max too. 
Anyway, since the change is simple enough, I think that it makes sense to offer 
that (global scale factor) option, specially for coarse-grained resource 
management.
We could create a ticket for that, what do you think Chen? Maybe you could just 
push your changes there? Otherwise I could send a proposal myself.

BTW A friend of mine (señor!) bumped Flink to 1.20 and he reported better task 
distribution. He will post an update on this soon...

Regards,

Salva

On Fri, Aug 15, 2025 at 1:04 PM Zhanghao Chen 
<zhanghao.c...@outlook.com<mailto:zhanghao.c...@outlook.com>> wrote:
Not quite possible based on the current version. We run an internal version of 
the Autoscaler in our production env. One major diff is that we let the whole 
pipeline (except the source/sink) to have the same parallelism to avoid uneven 
task distribution. The change is relatively simple, just run the algorithm per 
vertex, and take the max of them.

Best,
Zhanghao Chen
________________________________
From: Salva Alcántara <salcantara...@gmail.com<mailto:salcantara...@gmail.com>>
Sent: Thursday, August 14, 2025 12:24
To: user <user@flink.apache.org<mailto:user@flink.apache.org>>
Subject: Re: Autoscaling Global Scaling Factor (???)

That was on my agenda already. Will try and let you know how it goes.

Regarding my questions, do you think it's possible to achieve any of those 
points to make the autoscaler work as when you simply add/remove replicas by 
hand?

Thanks Chen!

Salva

On Thu, Aug 14, 2025 at 2:58 AM Zhanghao Chen 
<zhanghao.c...@outlook.com<mailto:zhanghao.c...@outlook.com>> wrote:
Hi, you may upgrade Flink to 1.19.3 or 1.20.2 or 2.0.1+. There's a known issue 
that Autoscaler may not minimize the number of TMs during downscaling with 
adaptive scheduler [1].

[1] https://issues.apache.org/jira/browse/FLINK-33977

Best,
Zhanghao Chen

________________________________
From: Salva Alcántara <salcantara...@gmail.com<mailto:salcantara...@gmail.com>>
Sent: Wednesday, August 13, 2025 20:56
To: user <user@flink.apache.org<mailto:user@flink.apache.org>>
Subject: RE: Autoscaling Global Scaling Factor (???)

BTW, I'm running on Flink 1.18.1 on top of operator 1.12.1 and the following 
autoscaler settings:

```
      job.autoscaler.enabled: "true"
      job.autoscaler.scaling.enabled: "true"
      job.autoscaler.scale-down.enabled: "true"
      job.autoscaler.vertex.max-parallelism: "8"
      job.autoscaler.vertex.min-parallelism: "1"
      jobmanager.scheduler: adaptive
      job.autoscaler.metrics.window: 15m
      job.autoscaler.metrics.busy-time.aggregator: MAX
      job.autoscaler.backlog-processing.lag-threshold: 2m
      job.autoscaler.scaling.effectiveness.detection.enabled: "true"
      job.autoscaler.scaling.effectiveness.threshold: "0.3"
      job.autoscaler.scaling.event.interval: 10m
      job.autoscaler.stabilization.interval: 5m
      job.autoscaler.scale-up.max-factor: "100000.0"
      job.autoscaler.scaling.key-group.partitions.adjust.mode: "EVENLY_SPREAD"
      job.autoscaler.scale-down.interval: 30m
      job.autoscaler.scale-down.max-factor: "0.5"
      job.autoscaler.memory.tuning.scale-down-compensation.enabled: "true"
      job.autoscaler.catch-up.duration: 5m
      job.autoscaler.restart.time: 15m
      job.autoscaler.restart.time-tracking.enabled: "true"
      job.autoscaler.utilization.target: "0.8"
```

Regards,

Salva

Re: Autoscaling Global Scaling Factor (???)

Reply via email to