> This raises a couple of questions:
> - Is it a bug though, that the state restoring goes wrong like it does for my
> job? Based on my experience it seems like rescaling sometimes works, but then
> you can have these random errors.
If there is a problem, I would still consider it a bug because it should work
> - If it's not supported properly, why not refuse to restore a checkpoint if
> it would require rescaling?
It should work properly, but I would preferred to keep this at the level of a
"hidden feature“ until it got some more exposure and also some questions about
the future of differences between savepoints and checkpoints are solved.
> - We have sometimes had Flink jobs where the state has become so heavy that
> cancelling with a savepoint times out & fails. Incremental checkpoints are
> still working because they don't timeout as long as the state is growing
> linearly. In that case if we want to scale up (for example to enable
> successful savepoint creation ;) ), the only thing we can do is to restore
> from the latest checkpoint. But then we have no way to scale up by increasing
> the cluster size, because we can't create a savepoint with a smaller cluster
> but on the other hand can't restore a checkpoint to a bigger cluster, if
> rescaling from a checkpoint is not supposed to be relied on. So in this case
> we're stuck and forced to start from an empty state?
IMO there is a very good chance that this will simply become a normal feature
in the near future.