Vijay created FLINK-34009:
-----------------------------

             Summary: Apache flink: Checkpoint restoration issue on Application 
Mode of deployment
                 Key: FLINK-34009
                 URL: https://issues.apache.org/jira/browse/FLINK-34009
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.18.0
         Environment: Flink version: 1.18

Zookeeper version: 3.7.2

Env: Custom flink docker image (with embedded application class) deployed over 
kubernetes (v1.26.11).
            Reporter: Vijay


Hi Team,

Good Day. Wish you all a happy new year 2024.

We are using Flink (1.18) version on our flink cluster. Job manager has been 
deployed on "Application mode" and HA is disabled (high-availability.type: 
NONE), under this configuration parameters we are able to start multiple jobs 
(using env.executeAsync()) of a single application.

Note: We have also setup checkpoint on a s3 instance with 
RETAIN_ON_CANCELLATION mode (plus other required settings).

Lets say now we start two jobs of the same application (ex: Jobidxxx1, 
jobidxxx2) and they are currently running on the k8s env. If we have to perform 
Flink minor upgrade (or) upgrade of our application with minor changes, in that 
case we will stop the Job Manager and Task Managers instances and perform the 
necessary up-gradation then when we start both Job Manager and Task Managers 
instance. On startup we expect the job's to be restored back from the last 
checkpoint, but the job restoration is not happening on Job manager startup. 
Please let us know if this is an bug (or) its the general behavior of flink 
under application mode of deployment.

Additional information: If we enable HA (using Zookeeper) on Application mode, 
we are able to startup only one job (i.e., per-job behavior). When we perform 
Flink minor upgrade (or) upgrade of our application with minor changes, the 
checkpoint restoration is working properly on Job Manager & Task Managers 
restart process.

It seems checkpoint restoration and HA are inter-related, but why checkpoint 
restoration doesn't work when HA is disabled.

 

Please let us know if anyone has experienced similar issues or if have any 
suggestions, it will be highly appreciated. Thanks in advance for your 
assistance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to