Hi Juho,
would you like to try out the latest 
RC(http://people.apache.org/~trohrmann/flink-1.5.0-rc4/) to rescaling the job 
from the "problematic" checkpoint? The latest RC includes a fix for the 
potential silently data lost. If it's the reason, you will see a different 
exception when you trying to recover you job.


Best, Sihua






On 05/18/2018 15:02,Juho Autio<juho.au...@rovio.com> wrote:
I see. I appreciate keeping this option available even if it's "beta". The 
current situation could be documented better, though.


As long as rescaling from checkpoint is not officially supported, I would put 
it behind a flag similar to --allowNonRestoredState. The flag could be called 
--allowRescalingRestoredCheckpointState, for example. This would make sure that 
users are aware that what they're using is experimental and might have 
unexpected effects.


As for the bug I faced, indeed I was able to reproduce it consistently. And I 
have provided TRACE-level logs personally to Stefan. If there is no Jira ticket 
for this yet, would you like me to create one?


On Thu, May 17, 2018 at 1:00 PM, Stefan Richter <s.rich...@data-artisans.com> 
wrote:
Hi,

>
> This raises a couple of questions:
> - Is it a bug though, that the state restoring goes wrong like it does for my 
> job? Based on my experience it seems like rescaling sometimes works, but then 
> you can have these random errors.

If there is a problem, I would still consider it a bug because it should work 
correctly.

> - If it's not supported properly, why not refuse to restore a checkpoint if 
> it would require rescaling?

It should work properly, but I would preferred to keep this at the level of a 
"hidden feature“ until it got some more exposure and also some questions about 
the future of differences between savepoints and checkpoints are solved.

> - We have sometimes had Flink jobs where the state has become so heavy that 
> cancelling with a savepoint times out & fails. Incremental checkpoints are 
> still working because they don't timeout as long as the state is growing 
> linearly. In that case if we want to scale up (for example to enable 
> successful savepoint creation ;) ), the only thing we can do is to restore 
> from the latest checkpoint. But then we have no way to scale up by increasing 
> the cluster size, because we can't create a savepoint with a smaller cluster 
> but on the other hand can't restore a checkpoint to a bigger cluster, if 
> rescaling from a checkpoint is not supposed to be relied on. So in this case 
> we're stuck and forced to start from an empty state?

IMO there is a very good chance that this will simply become a normal feature 
in the near future.

Best,
Stefan



Reply via email to