> If you say that you can reproduce the problem, does that mean reproduce
from the single existing checkpoint

Yes.

I haven't tried creating another checkpoint and rescaling from it. I can
try that.

> We are including rescaling in some end-to-end tests now and then let’s
see what happens.

If I understood correctly, there is some difference in how timers & other
state are written. It might be interesting if you would include a test with
state that holds both timers and keyed MapState, like the code snippet in
my original message. I believe this is a common usage pattern any way. Test
should verify transactionality of restoring both timers & MapState
consistently.

On Fri, May 18, 2018 at 10:51 AM, Stefan Richter <
s.rich...@data-artisans.com> wrote:

> Hi,
>
> I had a look at the logs from the restoring job and couldn’t find anything
> suspicious in them. Everything looks as expected and the state files are
> properly found and transferred from S3. We are including rescaling in some
> end-to-end tests now and then let’s see what happens.
> If you say that you can reproduce the problem, does that mean reproduce
> from the single existing checkpoint or also creating other problematic
> checkpoints? I am asking because maybe a log from the job that produces the
> problematic checkpoint might be more helpful. You can create a ticket if
> you want.
>
> Best,
> Stefan
>
>
> Am 18.05.2018 um 09:02 schrieb Juho Autio <juho.au...@rovio.com>:
>
> I see. I appreciate keeping this option available even if it's "beta". The
> current situation could be documented better, though.
>
> As long as rescaling from checkpoint is not officially supported, I would
> put it behind a flag similar to --allowNonRestoredState. The flag could be
> called --allowRescalingRestoredCheckpointState, for example. This would
> make sure that users are aware that what they're using is experimental and
> might have unexpected effects.
>
> As for the bug I faced, indeed I was able to reproduce it consistently.
> And I have provided TRACE-level logs personally to Stefan. If there is no
> Jira ticket for this yet, would you like me to create one?
>
> On Thu, May 17, 2018 at 1:00 PM, Stefan Richter <
> s.rich...@data-artisans.com> wrote:
>
>> Hi,
>>
>> >
>> > This raises a couple of questions:
>> > - Is it a bug though, that the state restoring goes wrong like it does
>> for my job? Based on my experience it seems like rescaling sometimes works,
>> but then you can have these random errors.
>>
>> If there is a problem, I would still consider it a bug because it should
>> work correctly.
>>
>> > - If it's not supported properly, why not refuse to restore a
>> checkpoint if it would require rescaling?
>>
>> It should work properly, but I would preferred to keep this at the level
>> of a "hidden feature“ until it got some more exposure and also some
>> questions about the future of differences between savepoints and
>> checkpoints are solved.
>>
>> > - We have sometimes had Flink jobs where the state has become so heavy
>> that cancelling with a savepoint times out & fails. Incremental checkpoints
>> are still working because they don't timeout as long as the state is
>> growing linearly. In that case if we want to scale up (for example to
>> enable successful savepoint creation ;) ), the only thing we can do is to
>> restore from the latest checkpoint. But then we have no way to scale up by
>> increasing the cluster size, because we can't create a savepoint with a
>> smaller cluster but on the other hand can't restore a checkpoint to a
>> bigger cluster, if rescaling from a checkpoint is not supposed to be relied
>> on. So in this case we're stuck and forced to start from an empty state?
>>
>> IMO there is a very good chance that this will simply become a normal
>> feature in the near future.
>>
>> Best,
>> Stefan
>>
>>
>
>

Reply via email to