Re: [Flink Kubernetes Operator] The "last-state" upgrade mode is only supported in FlinkDeployments

2024-05-01 Thread Alan Zhang
Thanks for answering my questions, Gyula! And your insights are very
helpful. Let me take a deeper look at the existing logic and think more.

On Tue, Apr 30, 2024 at 12:00 PM Gyula Fóra  wrote:

> The application mode indeed has a sticky jobId (at least when we are
> performing a last-state upgrade, otherwise a new jobId is generated during
> stateless deployments). But that's only part of the story and arguably the
> less important bit. The last-state upgrade mechanism for running/failing
> (but otherwise non-terminal) jobs relies on the Flink HA metadata to carry
> over the state information automagically. In Flink the HA mechanism always
> keeps track of the last state of a job so that even in the case  of a JM
> loss the job can correctly recover.
>
> The operator last-state upgrade uses this exact mechanism: we delete the
> deployment (JMs, and TMs) but keep the HA metadata and then start the new
> cluster with the upgraded spec. The JM will recover thinking that it's only
> a failover and pick up the state automatically. We can do this because we
> have 1 cluster - 1 job and upgrading means upgrading the entire deployment.
>
> The same is not true for session jobs where we can't use the HA metadata
> trick and we actually need to figure out the last state (the checkpoint or
> savepoint path). This can only be done through the JM rest api. This should
> be possible in most cases when the JM is healthy after cancelling the
> session job. By the way for terminal jobs (FAILED/FINISHED/CANCELLED) we
> also do similarly in case of the FlinkDeployments, where the last
> checkpoint info is queried from the JM (
> https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/SnapshotObserver.java#L74-L78
> )
> For session jobs you will not need sticky job ids because it's simply not
> relevant.
>
> Gyula
>
> On Tue, Apr 30, 2024 at 7:51 PM Alan Zhang  wrote:
>
>> Hi Gyula,
>>
>> Thanks for your reply! Good suggestion on JIRA ticket, I created a JIRA
>> ticket for tracking it: https://issues.apache.org/jira/browse/FLINK-35279.
>> We could be interested in working on it because of our own requirement, I
>> will check you and the community again once we have some updates.
>>
>> >We don't have the same robust way of getting the last-state information
>> for session jobs as we do for applications, so it will be slightly less
>> reliable overall.
>> My understanding is that application mode has sticky job id but session
>> mode doesn't have, with sticky job id it is easier to implement
>> "last-state" upgrade mode. When you were saying "robust way", does it mean
>> "sticky job id" in application mode?
>>
>>
>> On Mon, Apr 29, 2024 at 10:28 PM Gyula Fóra  wrote:
>>
>>> Hi Alan!
>>>
>>> I think it should be possible to address this gap for most cases. We
>>> don't have the same robust way of getting the last-state information for
>>> session jobs as we do for applications, so it will be slightly less
>>> reliable overall.
>>> For session jobs the last checkpoint info has to be queried from the JM
>>> rest api, so as long that is available it should work fine.
>>>
>>> I am not aware of anyone working on this at the moment, it would be
>>> great if you could open a JIRA ticket to track this. If you are interested
>>> in working on this, we can also support you but this is a fairly complex
>>> feature that involves many layers of operator logic.
>>>
>>> Cheers,
>>> Gyula
>>>
>>> On Tue, Apr 30, 2024 at 1:08 AM Alan Zhang  wrote:
>>>
 Hi,

 We wanted to use the Apache Flink Kubernetes operator to manage the
 lifecycle of our Flink jobs in Flink session clusters. And we wanted to
 have the "last-state" upgrade feature for our use cases.

 However, the latest official doc states the "last-state" upgrade mode
 is not supported in the session mode(aka. FlinkSessionJob) currently:
 https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades

 Last state upgrade mode is currently only supported for
 FlinkDeployments.

 Why didn't we support this upgrade mode in session mode? Do we have a
 plan to address this gap? Any suggestions for us if we want to stick with
 session mode?

 --
 Thanks,
 Alan

>>>
>>
>> --
>> Thanks,
>> Alan
>>
>

-- 
Thanks,
Alan


Re: [Flink Kubernetes Operator] The "last-state" upgrade mode is only supported in FlinkDeployments

2024-04-30 Thread Gyula Fóra
The application mode indeed has a sticky jobId (at least when we are
performing a last-state upgrade, otherwise a new jobId is generated during
stateless deployments). But that's only part of the story and arguably the
less important bit. The last-state upgrade mechanism for running/failing
(but otherwise non-terminal) jobs relies on the Flink HA metadata to carry
over the state information automagically. In Flink the HA mechanism always
keeps track of the last state of a job so that even in the case  of a JM
loss the job can correctly recover.

The operator last-state upgrade uses this exact mechanism: we delete the
deployment (JMs, and TMs) but keep the HA metadata and then start the new
cluster with the upgraded spec. The JM will recover thinking that it's only
a failover and pick up the state automatically. We can do this because we
have 1 cluster - 1 job and upgrading means upgrading the entire deployment.

The same is not true for session jobs where we can't use the HA metadata
trick and we actually need to figure out the last state (the checkpoint or
savepoint path). This can only be done through the JM rest api. This should
be possible in most cases when the JM is healthy after cancelling the
session job. By the way for terminal jobs (FAILED/FINISHED/CANCELLED) we
also do similarly in case of the FlinkDeployments, where the last
checkpoint info is queried from the JM (
https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/SnapshotObserver.java#L74-L78
)
For session jobs you will not need sticky job ids because it's simply not
relevant.

Gyula

On Tue, Apr 30, 2024 at 7:51 PM Alan Zhang  wrote:

> Hi Gyula,
>
> Thanks for your reply! Good suggestion on JIRA ticket, I created a JIRA
> ticket for tracking it: https://issues.apache.org/jira/browse/FLINK-35279.
> We could be interested in working on it because of our own requirement, I
> will check you and the community again once we have some updates.
>
> >We don't have the same robust way of getting the last-state information
> for session jobs as we do for applications, so it will be slightly less
> reliable overall.
> My understanding is that application mode has sticky job id but session
> mode doesn't have, with sticky job id it is easier to implement
> "last-state" upgrade mode. When you were saying "robust way", does it mean
> "sticky job id" in application mode?
>
>
> On Mon, Apr 29, 2024 at 10:28 PM Gyula Fóra  wrote:
>
>> Hi Alan!
>>
>> I think it should be possible to address this gap for most cases. We
>> don't have the same robust way of getting the last-state information for
>> session jobs as we do for applications, so it will be slightly less
>> reliable overall.
>> For session jobs the last checkpoint info has to be queried from the JM
>> rest api, so as long that is available it should work fine.
>>
>> I am not aware of anyone working on this at the moment, it would be great
>> if you could open a JIRA ticket to track this. If you are interested in
>> working on this, we can also support you but this is a fairly complex
>> feature that involves many layers of operator logic.
>>
>> Cheers,
>> Gyula
>>
>> On Tue, Apr 30, 2024 at 1:08 AM Alan Zhang  wrote:
>>
>>> Hi,
>>>
>>> We wanted to use the Apache Flink Kubernetes operator to manage the
>>> lifecycle of our Flink jobs in Flink session clusters. And we wanted to
>>> have the "last-state" upgrade feature for our use cases.
>>>
>>> However, the latest official doc states the "last-state" upgrade mode is
>>> not supported in the session mode(aka. FlinkSessionJob) currently:
>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades
>>>
>>> Last state upgrade mode is currently only supported for FlinkDeployments
>>> .
>>>
>>> Why didn't we support this upgrade mode in session mode? Do we have a
>>> plan to address this gap? Any suggestions for us if we want to stick with
>>> session mode?
>>>
>>> --
>>> Thanks,
>>> Alan
>>>
>>
>
> --
> Thanks,
> Alan
>


Re: [Flink Kubernetes Operator] The "last-state" upgrade mode is only supported in FlinkDeployments

2024-04-30 Thread Alan Zhang
Hi Gyula,

Thanks for your reply! Good suggestion on JIRA ticket, I created a JIRA
ticket for tracking it: https://issues.apache.org/jira/browse/FLINK-35279.
We could be interested in working on it because of our own requirement, I
will check you and the community again once we have some updates.

>We don't have the same robust way of getting the last-state information
for session jobs as we do for applications, so it will be slightly less
reliable overall.
My understanding is that application mode has sticky job id but session
mode doesn't have, with sticky job id it is easier to implement
"last-state" upgrade mode. When you were saying "robust way", does it mean
"sticky job id" in application mode?


On Mon, Apr 29, 2024 at 10:28 PM Gyula Fóra  wrote:

> Hi Alan!
>
> I think it should be possible to address this gap for most cases. We don't
> have the same robust way of getting the last-state information for session
> jobs as we do for applications, so it will be slightly less reliable
> overall.
> For session jobs the last checkpoint info has to be queried from the JM
> rest api, so as long that is available it should work fine.
>
> I am not aware of anyone working on this at the moment, it would be great
> if you could open a JIRA ticket to track this. If you are interested in
> working on this, we can also support you but this is a fairly complex
> feature that involves many layers of operator logic.
>
> Cheers,
> Gyula
>
> On Tue, Apr 30, 2024 at 1:08 AM Alan Zhang  wrote:
>
>> Hi,
>>
>> We wanted to use the Apache Flink Kubernetes operator to manage the
>> lifecycle of our Flink jobs in Flink session clusters. And we wanted to
>> have the "last-state" upgrade feature for our use cases.
>>
>> However, the latest official doc states the "last-state" upgrade mode is
>> not supported in the session mode(aka. FlinkSessionJob) currently:
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades
>>
>> Last state upgrade mode is currently only supported for FlinkDeployments.
>>
>> Why didn't we support this upgrade mode in session mode? Do we have a
>> plan to address this gap? Any suggestions for us if we want to stick with
>> session mode?
>>
>> --
>> Thanks,
>> Alan
>>
>

-- 
Thanks,
Alan


Re: [Flink Kubernetes Operator] The "last-state" upgrade mode is only supported in FlinkDeployments

2024-04-29 Thread Gyula Fóra
Hi Alan!

I think it should be possible to address this gap for most cases. We don't
have the same robust way of getting the last-state information for session
jobs as we do for applications, so it will be slightly less reliable
overall.
For session jobs the last checkpoint info has to be queried from the JM
rest api, so as long that is available it should work fine.

I am not aware of anyone working on this at the moment, it would be great
if you could open a JIRA ticket to track this. If you are interested in
working on this, we can also support you but this is a fairly complex
feature that involves many layers of operator logic.

Cheers,
Gyula

On Tue, Apr 30, 2024 at 1:08 AM Alan Zhang  wrote:

> Hi,
>
> We wanted to use the Apache Flink Kubernetes operator to manage the
> lifecycle of our Flink jobs in Flink session clusters. And we wanted to
> have the "last-state" upgrade feature for our use cases.
>
> However, the latest official doc states the "last-state" upgrade mode is
> not supported in the session mode(aka. FlinkSessionJob) currently:
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades
>
> Last state upgrade mode is currently only supported for FlinkDeployments.
>
> Why didn't we support this upgrade mode in session mode? Do we have a plan
> to address this gap? Any suggestions for us if we want to stick with
> session mode?
>
> --
> Thanks,
> Alan
>


[Flink Kubernetes Operator] The "last-state" upgrade mode is only supported in FlinkDeployments

2024-04-29 Thread Alan Zhang
Hi,

We wanted to use the Apache Flink Kubernetes operator to manage the
lifecycle of our Flink jobs in Flink session clusters. And we wanted to
have the "last-state" upgrade feature for our use cases.

However, the latest official doc states the "last-state" upgrade mode is
not supported in the session mode(aka. FlinkSessionJob) currently:
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades

Last state upgrade mode is currently only supported for FlinkDeployments.

Why didn't we support this upgrade mode in session mode? Do we have a plan
to address this gap? Any suggestions for us if we want to stick with
session mode?

-- 
Thanks,
Alan