Re: Questions on Restarting a Flink Application from a savepoint or checkpoint

Gyula Fóra Wed, 19 Jul 2023 13:33:33 -0700

Hi!

I don’t understand why you need to delete the deployment to restart. You
can suspend, use the restartNonce or simply upgrade .


These should cover most upgrade/restart scenarios. Like with other
resources in Kubernetes once you delete them the status is gone, so the
FlinkDeployment won’t keep the last state info.

To keep the state after deletion you would have to introduce new resources
or an external state store. We are not planning to support this as it goes
against the standard Kubernetes resource management flow.

I think you should look into simply suspending the job for the while or
just use a regular upgrade to fit your needs .

Cheers
Gyula

On Wed, 19 Jul 2023 at 22:19, Tony Chen <tony.ch...@robinhood.com> wrote:

> Hi Gyula,
>
> Thank you for responding so quickly. I went through the page you sent me a
> bit more, and I see the following (
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.4/docs/custom-resource/job-management/#running-suspending-and-deleting-applications
> ):
>
> Deleting a deployment will remove all checkpoint and status information.
>> Future deployments will from an empty state unless manually overridden by
>> the user.
>>
>
> For our use case, we do delete the deployment and redeploy the Flink
> application sometimes in order to restart our Flink applications. We were
> wondering if it's possible for the operator to retain checkpoint and status
> information even after the deployment gets deleted.
>
> Thanks,
> Tony
>
> On Wed, Jul 19, 2023 at 3:46 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
>
>> Hey Tony,
>>
>> Please see:
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades
>>
>> The operator is made especially to handle stateful application upgrades
>> robustly. In general any spec change that you make that will lead to an
>> upgrade will be executed using the latest available / checkpoint or
>> savepoint. This is controlled by the `upgradeMode` setting for jobs, as
>> long as you have last-state or savepoint you will always get the latest
>> state.
>>
>> This is somewhat orthogonal to the savepoint trigger /
>> initialSavepointPath mechanisms. The initialSavepointPath should be used
>> only the first time the deployment is created because at that point the
>> operator is not aware of the latest state. After that all upgrades always
>> use the latest state unless the upgradeMode is stateless in which case no
>> state is used. Savepoint triggering can help you keep backups for failure
>> recovery but they should not be executed as part of your upgrade flow
>> because the operator already does this for you.
>>
>> Cheers,
>> Gyula
>>
>> On Wed, Jul 19, 2023 at 8:20 PM Tony Chen <tony.ch...@robinhood.com>
>> wrote:
>>
>>> Hi Flink Community,
>>>
>>> My name is Tony Chen, and I am a software engineer at Robinhood. I have
>>> some questions on restarting a Flink Application from a savepoint or
>>> checkpoint.
>>>
>>> We currently store our checkpoints and savepoints in S3, and we would
>>> like to use the Apache Flink Kubernetes Operator to manage our Flink
>>> applications. I know that there is a field called "initialSavepointPath" (
>>> doc
>>> <https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#manual-recovery>)
>>> that I can set in my kubernetes manifest so that whenever I want my Flink
>>> application to start from a particular savepoint, it will start from
>>> the savepoint directory in this field. However, if I delete this
>>> FlinkDeployment resource altogether after new savepoints were triggered,
>>> and then redeploy this FlinkDeployment resource, it looks like I have to
>>> manually update the "initialSavepointPath" to a newer savepoint directory
>>> so that the Flink application starts from a newer savepoint.
>>>
>>> Is there a way for us to redeploy FlinkDeployment resources so that the
>>> latest checkpoint or savepoint is used, and without having to update the
>>> "initialSavepointPath" field? I noticed in my testing that whenever I
>>> deleted the FlinkDeployment resource and redeploy, it would either start
>>> from the savepoint in initialSavepointPath or from checkpoint 1 if
>>> initialSavepointPath was not set.
>>>
>>> For example, let's say I restarted a Flink application at savepoint 10
>>> with initialSavepointPath set to s3://savepoints/savepoint-10, and then
>>> later on a savepoint 20 was completed and stored at
>>> s3://savepoints/savepoint-20. Is there a way for me to delete this
>>> FlinkDeployment and redeploy it without updating initialSavepointPath?
>>>
>>> Thanks,
>>> Tony
>>>
>>> P.S. I'm going through the source code more for Apache Flink Kubernetes
>>> Operator to understand how the operator starts a Flink job. Some relevant
>>> code:
>>>
>>>    -
>>>    
>>> https://github.com/apache/flink-kubernetes-operator/blob/0c341ebe13645f4e9802cfd780c5b50f59e29363/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L500
>>>    -
>>>    
>>> https://github.com/apache/flink-kubernetes-operator/blob/0c341ebe13645f4e9802cfd780c5b50f59e29363/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/SavepointObserver.java#L204
>>>
>>>
>>> --
>>>
>>> <http://www.robinhood.com/>
>>>
>>> Tony Chen
>>>
>>> Software Engineer
>>>
>>> Menlo Park, CA
>>>
>>> Don't copy, share, or use this email without permission. If you received
>>> it by accident, please let us know and then delete it right away.
>>>
>>
>
> --
>
> <http://www.robinhood.com/>
>
> Tony Chen
>
> Software Engineer
>
> Menlo Park, CA
>
> Don't copy, share, or use this email without permission. If you received
> it by accident, please let us know and then delete it right away.
>

Re: Questions on Restarting a Flink Application from a savepoint or checkpoint

Reply via email to