[ 
https://issues.apache.org/jira/browse/FLINK-25236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456700#comment-17456700
 ] 

Seth Wiesman edited comment on FLINK-25236 at 12/9/21, 9:31 PM:
----------------------------------------------------------------

Unfortunately, this is not so simple to do.

The method you linked to only validates that max parallelism has not been 
changed, and each operator has a mapping within the checkpoint. State changes 
are validated lazily when the state descriptor is registered within the runtime 
context; because state descriptors themselves are registered lazily. This means 
the only way to validate a DataStream application can restore from a snapshot 
fully is to attempt the restore.

 

I would typically recommend either QA testing or Blue / Green deployments for 
these kinds of low latency requirements. Both are readily achieved with Flinks 
snapshot-based fault-tolerance model. Take a savepoint of your production 
workload and use that to start your new application in a QA environment. This 
new application can read from production sources, have internal production 
state, and so long as sinks are configured dynamically, this new version can 
write to a non-production output. If this restore works, then deploying to 
production is guaranteed to succeed. This also allows you the opportunity to 
validate the output of your changes before deploying them to production. 

 

 


was (Author: sjwiesman):
Unfortunately, this is not so simple to do.

The method you linked to only validates that max parallelism has not been 
changed, and each operator has a mapping within the checkpoint. State changes 
themselves are validated lazily when the state descriptor is registered within 
the runtime context; because state descriptors themselves are registered 
lazily. This means the only way to validate a DataStream application can 
restore from a snapshot fully is to attempt the restore.

 

I would typically recommend either QA testing or Blue / Green deployments for 
these kinds of low latency requirements. Both are readily achieved with Flinks 
snapshot-based fault-tolerance model. Take a savepoint of your production 
workload and use that to start your new application in a QA environment. This 
new application can read from production sources, have internal production 
state, and so long as sinks are configured dynamically, this new version can 
write to a non-production output. If this restore works, then deploying to 
production is guaranteed to succeed. This also allows you the opportunity to 
validate the output of your changes before deploying them to production. 

 

 

> Add a mechanism to generate and validate a jobgraph with a checkpoint before 
> submission
> ---------------------------------------------------------------------------------------
>
>                 Key: FLINK-25236
>                 URL: https://issues.apache.org/jira/browse/FLINK-25236
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Ben Augarten
>            Priority: Major
>
> I've mostly worked on flink 1.9-1.12, but I believe this is still an issue 
> today. 
>  
> I've worked on a few flink applications now that have struggled to reliably 
> activate new versions of a currently running job. Sometimes, users make 
> changes to a job graph that make it so state cannot be restored. Sometimes 
> users make changes to a job graph that make it unable to be scheduled on a 
> given cluster (increased parallelism with insufficient task slots on the 
> cluster). These validations are [performed 
> here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L120]
>  
> It's not flink's problem that these issues arise, but these issues are only 
> detected when the JM tries to run the given jobgraph. For exactly once 
> applications (and other applications where running two job graphs for the 
> same application is undesirable) there is unneeded downtime when users submit 
> jobgraphs with breaking changes because users must cancel the old job, submit 
> the new job to see if it is valid and will activate, and then resubmit the 
> old job when activation fails. As a user with low-latency requirements, this 
> change management solution is unfortunate, and there doesn't seem to be 
> anything technical preventing these validations from happening earlier.
>  
> Suggestion: provide a mechanism for users to (1) create and (2) validate the 
> new job graph+checkpoint without running it so that they do not need to 
> cancel a currently running version of the job until they're more sure it will 
> activate



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to