> If you say that you can reproduce the problem, does that mean reproduce from the single existing checkpoint
Yes. I haven't tried creating another checkpoint and rescaling from it. I can try that. > We are including rescaling in some end-to-end tests now and then let’s see what happens. If I understood correctly, there is some difference in how timers & other state are written. It might be interesting if you would include a test with state that holds both timers and keyed MapState, like the code snippet in my original message. I believe this is a common usage pattern any way. Test should verify transactionality of restoring both timers & MapState consistently. On Fri, May 18, 2018 at 10:51 AM, Stefan Richter < s.rich...@data-artisans.com> wrote: > Hi, > > I had a look at the logs from the restoring job and couldn’t find anything > suspicious in them. Everything looks as expected and the state files are > properly found and transferred from S3. We are including rescaling in some > end-to-end tests now and then let’s see what happens. > If you say that you can reproduce the problem, does that mean reproduce > from the single existing checkpoint or also creating other problematic > checkpoints? I am asking because maybe a log from the job that produces the > problematic checkpoint might be more helpful. You can create a ticket if > you want. > > Best, > Stefan > > > Am 18.05.2018 um 09:02 schrieb Juho Autio <juho.au...@rovio.com>: > > I see. I appreciate keeping this option available even if it's "beta". The > current situation could be documented better, though. > > As long as rescaling from checkpoint is not officially supported, I would > put it behind a flag similar to --allowNonRestoredState. The flag could be > called --allowRescalingRestoredCheckpointState, for example. This would > make sure that users are aware that what they're using is experimental and > might have unexpected effects. > > As for the bug I faced, indeed I was able to reproduce it consistently. > And I have provided TRACE-level logs personally to Stefan. If there is no > Jira ticket for this yet, would you like me to create one? > > On Thu, May 17, 2018 at 1:00 PM, Stefan Richter < > s.rich...@data-artisans.com> wrote: > >> Hi, >> >> > >> > This raises a couple of questions: >> > - Is it a bug though, that the state restoring goes wrong like it does >> for my job? Based on my experience it seems like rescaling sometimes works, >> but then you can have these random errors. >> >> If there is a problem, I would still consider it a bug because it should >> work correctly. >> >> > - If it's not supported properly, why not refuse to restore a >> checkpoint if it would require rescaling? >> >> It should work properly, but I would preferred to keep this at the level >> of a "hidden feature“ until it got some more exposure and also some >> questions about the future of differences between savepoints and >> checkpoints are solved. >> >> > - We have sometimes had Flink jobs where the state has become so heavy >> that cancelling with a savepoint times out & fails. Incremental checkpoints >> are still working because they don't timeout as long as the state is >> growing linearly. In that case if we want to scale up (for example to >> enable successful savepoint creation ;) ), the only thing we can do is to >> restore from the latest checkpoint. But then we have no way to scale up by >> increasing the cluster size, because we can't create a savepoint with a >> smaller cluster but on the other hand can't restore a checkpoint to a >> bigger cluster, if rescaling from a checkpoint is not supposed to be relied >> on. So in this case we're stuck and forced to start from an empty state? >> >> IMO there is a very good chance that this will simply become a normal >> feature in the near future. >> >> Best, >> Stefan >> >> > >