Re: Rolling back a bad deployment of FlinkDeployment on kubernetes

Gyula Fóra Thu, 05 Oct 2023 12:46:09 -0700

Hi Tony!

There are still a few corner cases when the operator cannot upgrade /
rollback deployments due to the loss of HA metadata (and with that
checkpoint information).


Most of these issues are not related to the operator logic directly but to
how Flink handles certain failures and are related to:

https://issues.apache.org/jira/browse/FLINK-30444 and
https://cwiki.apache.org/confluence/display/FLINK/FLIP-360%3A+Merging+the+ExecutionGraphInfoStore+and+the+JobResultStore+into+a+single+component+CompletedJobStore

Rollbacks are designed to allow automatic fallback to the last stable spec,
but the mechanism doesn't work in these corner cases (in the same way spec
upgrades also dont)

I hope this helps to understand the problem.
The solution in these cases is to manually recover the job from the last
checkpoint/savepoint.

Cheers,
Gyula


On Thu, Oct 5, 2023 at 7:56 PM Tony Chen <tony.ch...@robinhood.com> wrote:

> I tried this out with operator version 1.4 and it didn't work for me. I
> noticed that when I was deploying a bad version, the Kubernetes HA metadata
> and configmaps were deleted:
>
> [m [33m2023-10-05 14:52:17,493 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO
> ][flink-testing-service/flink-testing-service] >>> Event | Info |
> SPECCHANGED | UPGRADE change(s) detected
> (FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJo,job.initialSavepointPath=s3a://robinhood-prod-flink/flink-testing-service/savepoints/savepoint-b832ef-05b185cb5800]
> differs from
> FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJob,job.initialSavepointPath=<null>]),
> starting reconciliation.
> ...
> [m [33m2023-10-05 14:52:51,054 [m [36mo.a.f.k.o.s.AbstractFlinkService [m
> [32m[INFO ][flink-testing-service/flink-testing-service] Cluster shutdown
> completed.
> [m [33m2023-10-05 14:52:51,054 [m [36mo.a.f.k.o.s.AbstractFlinkService [m
> [32m[INFO ][flink-testing-service/flink-testing-service] Deleting
> Kubernetes HA metadata
> [m [33m2023-10-05 14:52:51,196 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO
> ][flink-testing-service/flink-testing-service] >>> Status | Info |
> UPGRADING | The resource is being upgraded
>
>
>
> Eventually, the rollbak fails because the HA metadata is missing:
>
> [m [33m2023-10-05 14:58:16,119 [m
> [36mo.a.f.k.o.r.d.AbstractFlinkResourceReconciler [m [33m[WARN
> ][flink-testing-service/flink-testing-service] Rollback is not possible due
> to missing HA metadata
>
>
>
> Besides setting kubernetes.operator.deployment.rollback.enabled: true, is
> there anything else that I need to configure?
>
> On Thu, Oct 5, 2023 at 10:35 AM Tony Chen <tony.ch...@robinhood.com>
> wrote:
>
>> I just saw this experimental feature in the documentation:
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#application-upgrade-rollbacks-experimental
>>
>> I'm guessing this is the only way to automate rollbacks for now.
>>
>> On Wed, Oct 4, 2023 at 3:25 PM Tony Chen <tony.ch...@robinhood.com>
>> wrote:
>>
>>> Hi Flink Community,
>>>
>>> I am currently running Apache flink-kubernetes-operator on our
>>> kubernetes clusters, and I have Flink applications that are deployed using
>>> the FlinkDeployment Custom Resources (CR). I am trying to automate the
>>> process of rollbacks and I am running into some issues.
>>>
>>> I was testing out a bad deployment where the jobmanager never becomes
>>> healthy. I simulated this bad deployment by creating a Flink image with a
>>> bug in it. I see in the operator logs that the jobmanager is unhealthy:
>>>
>>> [m [33m2023-10-02 22:14:34,874 [m
>>> [36mo.a.f.k.o.r.d.AbstractFlinkResourceReconciler [m [32m[INFO
>>> ][flink-testing-service/flink-testing-service] UPGRADE change(s) detected
>>> (FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJo]
>>> differs from
>>> FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJob]),
>>> starting reconciliation.
>>> ...
>>> [m [33m2023-10-02 22:15:09,001 [m [36mo.a.f.k.o.l.AuditUtils [m
>>> [32m[INFO ][flink-testing-service/flink-testing-service] >>> Status | Info
>>> | UPGRADING | The resource is being upgraded
>>> ...
>>> [m [33m2023-10-02 22:17:23,911 [m [36mo.a.f.k.o.l.AuditUtils [m
>>> [32m[INFO ][flink-testing-service/flink-testing-service] >>> Status | Error
>>> | DEPLOYED |
>>> {"type":"org.apache.flink.kubernetes.operator.exception.DeploymentFailedException","message":"back-off
>>> 20s restarting failed container=flink-main-container
>>> pod=flink-testing-service-749dd97c75-4w9ps_flink-testing-service(6db1adb3-4ca4-4924-a8c3-57a417818d85)","additionalMetadata":{"reason":"CrashLoopBackOff"},"throwableList":[]}
>>>
>>> ...
>>> [m [33m2023-10-02 22:17:33,576 [m [36mo.a.f.k.o.o.d.ApplicationObserver
>>> [m [32m[INFO ][flink-testing-service/flink-testing-service] Observing
>>> JobManager deployment. Previous status: ERROR
>>>
>>>
>>> What I do next is I change the spec of the FlinkDeployment so that it
>>> uses a Flink image that is healthy. The operator shows that the spec has
>>> changed:
>>>
>>> [m [33m2023-10-02 22:45:37,445 [m [36mo.a.f.k.o.l.AuditUtils [m
>>> [32m[INFO ][flink-testing-service/flink-testing-service] >>> Event | Info |
>>> SPECCHANGED | UPGRADE change(s) detected
>>> (FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJob,job.initialSavepointPath=s3a://robinhood-dev-core-flink-states/flink-testing-service/savepoints/savepoint-329a14-2f8264206b1d]
>>> differs from
>>> FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJo,job.initialSavepointPath=s3a://robinhood-dev-core-flink-states/flink-testing-service/savepoints/savepoint-dc1077-134923759e30]),
>>> starting reconciliation.
>>>
>>>
>>> However, the Flink operator cannot reconcile this spec change, and the
>>> jobmanager is now permanently failing because it's still running the bad
>>> Flink image:
>>>
>>> [m [33m2023-10-02 22:45:37,461 [m [36mo.a.f.k.o.l.AuditUtils [m
>>> [32m[INFO ][flink-testing-service/flink-testing-service] >>> Event |
>>> Warning | UPGRADEFAILED | JobManager deployment is missing and HA data is
>>> not available to make stateful upgrades. It is possible that the job has
>>> finished or terminally failed, or the configmaps have been deleted. Manual
>>> restore required.
>>>
>>> I can simply delete this FlinkDeployment and redeploy with the healthy
>>> Flink image, but I would like to avoid manual restores if possible. Is it
>>> possible to recover by just changing the FlinkDeployment spec?
>>>
>>> Thanks,
>>> Tony
>>>
>>> --
>>>
>>> <http://www.robinhood.com/>
>>>
>>> Tony Chen
>>>
>>> Software Engineer
>>>
>>> Menlo Park, CA
>>>
>>> Don't copy, share, or use this email without permission. If you received
>>> it by accident, please let us know and then delete it right away.
>>>
>>
>>
>> --
>>
>> <http://www.robinhood.com/>
>>
>> Tony Chen
>>
>> Software Engineer
>>
>> Menlo Park, CA
>>
>> Don't copy, share, or use this email without permission. If you received
>> it by accident, please let us know and then delete it right away.
>>
>
>
> --
>
> <http://www.robinhood.com/>
>
> Tony Chen
>
> Software Engineer
>
> Menlo Park, CA
>
> Don't copy, share, or use this email without permission. If you received
> it by accident, please let us know and then delete it right away.
>

Re: Rolling back a bad deployment of FlinkDeployment on kubernetes

Reply via email to