Re: Recover from savepoints with Kubernetes HA

2021-07-23 Thread Austin Cawley-Edwards
Great, glad it was an easy fix :) Thanks for following up!

On Fri, Jul 23, 2021 at 3:54 AM Thms Hmm  wrote:

> Finally I found the mistake. I put the „—host 10.1.2.3“ param as one
> argument. I think the savepoint argument was not interpreted correctly or
> ignored. Might be that the „-s“ param was used as value for „—host
> 10.1.2.3“ and „s3p://…“ as new param and because these are not valid
> arguments they were ignored.
>
> Not working:
>
> 23.07.2021 09:19:54.546 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  Program Arguments:
>
> ...
>
> 23.07.2021 09:19:54.549 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --host 10.1.2.3
>
> 23.07.2021 09:19:54.549 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -s
>
> 23.07.2021 09:19:54.549 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
> s3p://bucket/job1/savepoints/savepoint-00-1234
>
> -
>
> Working:
>
> 23.07.2021 09:19:54.546 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  Program Arguments:
>
> ...
>
> 23.07.2021 09:19:54.549 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --host
>
> 23.07.2021 09:19:54.549 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 10.1.2.3
>
> 23.07.2021 09:19:54.549 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -s
>
> 23.07.2021 09:19:54.549 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
> s3p://bucket/job1/savepoints/savepoint-00-1234
>
> ...
>
> 23.07.2021 09:37:12.932 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Starting job
>  from savepoint
> s3p://bucket/job1/savepoints/savepoint-00-1234 ()
>
> Thanks again for your help.
>
> Kr Thomas
>
> Yang Wang  schrieb am Fr. 23. Juli 2021 um 04:34:
>
>> Please note that when the job is canceled, the HA data(including the
>> checkpoint pointers) stored in the ConfigMap/ZNode will be deleted.
>>
>> But it is strange that the "-s/--fromSavepoint" does not take effect when
>> redeploying the Flink application. The JobManager logs could help a lot to
>> find the root cause.
>>
>> Best,
>> Yang
>>
>> Austin Cawley-Edwards  于2021年7月22日周四 下午11:09写道:
>>
>>> Hey Thomas,
>>>
>>> Hmm, I see no reason why you should not be able to update the checkpoint
>>> interval at runtime, and don't believe that information is stored in a
>>> savepoint. Can you share the JobManager logs of the job where this is
>>> ignored?
>>>
>>> Thanks,
>>> Austin
>>>
>>> On Wed, Jul 21, 2021 at 11:47 AM Thms Hmm  wrote:
>>>
 Hey Austin,

 Thanks for your help.

 I tried to change the checkpoint interval as example. The value for it
 comes from an additional config file and is read and set within main() of
 the job.

 The job is running in Application mode. Basically the same
 configuration as from the official Flink website but instead of running the
 JobManager as job it is created as deployment.

 For the redeployment of the job the REST API is triggered to create a
 savepoint and cancel the job. After completion the deployment is updated
 and the pods are recreated. The -s  Is always added as a
 parameter to start the JobManager (standalone-job.sh). CLI is not involved.
 We have automated these steps. But I tried the steps manually and have the
 same results.

 I also tried to trigger a savepoint, scale the pods down, update the
 start parameter with the recent savepoint and renamed
 ‚kubernetes.cluster-id‘ as well as ‚high-availability.storageDir‘.

 When I trigger a savepoint with cancel, I also see that the HA config
 maps are cleaned up.


 Kr Thomas

 Austin Cawley-Edwards  schrieb am Mi. 21.
 Juli 2021 um 16:52:

> Hi Thomas,
>
> I've got a few questions that will hopefully help get to find an
> answer:
>
> What job properties are you trying to change? Something like
> parallelism?
>
> What mode is your job running in? i.e., Session, Per-Job, or
> Application?
>
> Can you also describe how you're redeploying the job? Are you using
> the Native Kubernetes integration or Standalone (i.e. writing k8s  
> manifest
> files yourself)? It sounds like you are using the Flink CLI as well, is
> that correct?
>
> Thanks,
> Austin
>
> On Wed, Jul 21, 2021 at 4:05 AM Thms Hmm  wrote:
>
>> Hey,
>>
>> we have some application clusters running on Kubernetes and explore
>> the HA mode which is working as expected. When we try to upgrade a job,
>> e.g. trigger a savepoint, cancel the job and redeploy, Flink is not
>> restarting from the savepoint we provide using the -s parameter. So all
>> state is lost.
>>
>> If we just trigger the savepoint without canceling the job and
>> redeploy the HA mode picks up from the latest savepoint.
>>
>> But 

Re: Recover from savepoints with Kubernetes HA

2021-07-23 Thread Thms Hmm
Finally I found the mistake. I put the „—host 10.1.2.3“ param as one
argument. I think the savepoint argument was not interpreted correctly or
ignored. Might be that the „-s“ param was used as value for „—host
10.1.2.3“ and „s3p://…“ as new param and because these are not valid
arguments they were ignored.

Not working:

23.07.2021 09:19:54.546 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  Program Arguments:

...

23.07.2021 09:19:54.549 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --host 10.1.2.3

23.07.2021 09:19:54.549 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -s

23.07.2021 09:19:54.549 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
s3p://bucket/job1/savepoints/savepoint-00-1234

-

Working:

23.07.2021 09:19:54.546 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  Program Arguments:

...

23.07.2021 09:19:54.549 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --host

23.07.2021 09:19:54.549 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 10.1.2.3

23.07.2021 09:19:54.549 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -s

23.07.2021 09:19:54.549 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
s3p://bucket/job1/savepoints/savepoint-00-1234

...

23.07.2021 09:37:12.932 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Starting job
 from savepoint
s3p://bucket/job1/savepoints/savepoint-00-1234 ()

Thanks again for your help.

Kr Thomas

Yang Wang  schrieb am Fr. 23. Juli 2021 um 04:34:

> Please note that when the job is canceled, the HA data(including the
> checkpoint pointers) stored in the ConfigMap/ZNode will be deleted.
>
> But it is strange that the "-s/--fromSavepoint" does not take effect when
> redeploying the Flink application. The JobManager logs could help a lot to
> find the root cause.
>
> Best,
> Yang
>
> Austin Cawley-Edwards  于2021年7月22日周四 下午11:09写道:
>
>> Hey Thomas,
>>
>> Hmm, I see no reason why you should not be able to update the checkpoint
>> interval at runtime, and don't believe that information is stored in a
>> savepoint. Can you share the JobManager logs of the job where this is
>> ignored?
>>
>> Thanks,
>> Austin
>>
>> On Wed, Jul 21, 2021 at 11:47 AM Thms Hmm  wrote:
>>
>>> Hey Austin,
>>>
>>> Thanks for your help.
>>>
>>> I tried to change the checkpoint interval as example. The value for it
>>> comes from an additional config file and is read and set within main() of
>>> the job.
>>>
>>> The job is running in Application mode. Basically the same configuration
>>> as from the official Flink website but instead of running the JobManager as
>>> job it is created as deployment.
>>>
>>> For the redeployment of the job the REST API is triggered to create a
>>> savepoint and cancel the job. After completion the deployment is updated
>>> and the pods are recreated. The -s  Is always added as a
>>> parameter to start the JobManager (standalone-job.sh). CLI is not involved.
>>> We have automated these steps. But I tried the steps manually and have the
>>> same results.
>>>
>>> I also tried to trigger a savepoint, scale the pods down, update the
>>> start parameter with the recent savepoint and renamed
>>> ‚kubernetes.cluster-id‘ as well as ‚high-availability.storageDir‘.
>>>
>>> When I trigger a savepoint with cancel, I also see that the HA config
>>> maps are cleaned up.
>>>
>>>
>>> Kr Thomas
>>>
>>> Austin Cawley-Edwards  schrieb am Mi. 21. Juli
>>> 2021 um 16:52:
>>>
 Hi Thomas,

 I've got a few questions that will hopefully help get to find an answer:

 What job properties are you trying to change? Something like
 parallelism?

 What mode is your job running in? i.e., Session, Per-Job, or
 Application?

 Can you also describe how you're redeploying the job? Are you using the
 Native Kubernetes integration or Standalone (i.e. writing k8s  manifest
 files yourself)? It sounds like you are using the Flink CLI as well, is
 that correct?

 Thanks,
 Austin

 On Wed, Jul 21, 2021 at 4:05 AM Thms Hmm  wrote:

> Hey,
>
> we have some application clusters running on Kubernetes and explore
> the HA mode which is working as expected. When we try to upgrade a job,
> e.g. trigger a savepoint, cancel the job and redeploy, Flink is not
> restarting from the savepoint we provide using the -s parameter. So all
> state is lost.
>
> If we just trigger the savepoint without canceling the job and
> redeploy the HA mode picks up from the latest savepoint.
>
> But this way we can not upgrade job properties as they were picked up
> from the savepoint as it seems.
>
> Is there any advice on how to do upgrades with HA enabled?
>
> Flink version is 1.12.2.
>
> Thanks for your help.
>
> Kr thomas
>



Re: Recover from savepoints with Kubernetes HA

2021-07-22 Thread Yang Wang
Please note that when the job is canceled, the HA data(including the
checkpoint pointers) stored in the ConfigMap/ZNode will be deleted.

But it is strange that the "-s/--fromSavepoint" does not take effect when
redeploying the Flink application. The JobManager logs could help a lot to
find the root cause.

Best,
Yang

Austin Cawley-Edwards  于2021年7月22日周四 下午11:09写道:

> Hey Thomas,
>
> Hmm, I see no reason why you should not be able to update the checkpoint
> interval at runtime, and don't believe that information is stored in a
> savepoint. Can you share the JobManager logs of the job where this is
> ignored?
>
> Thanks,
> Austin
>
> On Wed, Jul 21, 2021 at 11:47 AM Thms Hmm  wrote:
>
>> Hey Austin,
>>
>> Thanks for your help.
>>
>> I tried to change the checkpoint interval as example. The value for it
>> comes from an additional config file and is read and set within main() of
>> the job.
>>
>> The job is running in Application mode. Basically the same configuration
>> as from the official Flink website but instead of running the JobManager as
>> job it is created as deployment.
>>
>> For the redeployment of the job the REST API is triggered to create a
>> savepoint and cancel the job. After completion the deployment is updated
>> and the pods are recreated. The -s  Is always added as a
>> parameter to start the JobManager (standalone-job.sh). CLI is not involved.
>> We have automated these steps. But I tried the steps manually and have the
>> same results.
>>
>> I also tried to trigger a savepoint, scale the pods down, update the
>> start parameter with the recent savepoint and renamed
>> ‚kubernetes.cluster-id‘ as well as ‚high-availability.storageDir‘.
>>
>> When I trigger a savepoint with cancel, I also see that the HA config
>> maps are cleaned up.
>>
>>
>> Kr Thomas
>>
>> Austin Cawley-Edwards  schrieb am Mi. 21. Juli
>> 2021 um 16:52:
>>
>>> Hi Thomas,
>>>
>>> I've got a few questions that will hopefully help get to find an answer:
>>>
>>> What job properties are you trying to change? Something like parallelism?
>>>
>>> What mode is your job running in? i.e., Session, Per-Job, or
>>> Application?
>>>
>>> Can you also describe how you're redeploying the job? Are you using the
>>> Native Kubernetes integration or Standalone (i.e. writing k8s  manifest
>>> files yourself)? It sounds like you are using the Flink CLI as well, is
>>> that correct?
>>>
>>> Thanks,
>>> Austin
>>>
>>> On Wed, Jul 21, 2021 at 4:05 AM Thms Hmm  wrote:
>>>
 Hey,

 we have some application clusters running on Kubernetes and explore the
 HA mode which is working as expected. When we try to upgrade a job, e.g.
 trigger a savepoint, cancel the job and redeploy, Flink is not restarting
 from the savepoint we provide using the -s parameter. So all state is lost.

 If we just trigger the savepoint without canceling the job and redeploy
 the HA mode picks up from the latest savepoint.

 But this way we can not upgrade job properties as they were picked up
 from the savepoint as it seems.

 Is there any advice on how to do upgrades with HA enabled?

 Flink version is 1.12.2.

 Thanks for your help.

 Kr thomas

>>>


Re: Recover from savepoints with Kubernetes HA

2021-07-22 Thread Austin Cawley-Edwards
Hey Thomas,

Hmm, I see no reason why you should not be able to update the checkpoint
interval at runtime, and don't believe that information is stored in a
savepoint. Can you share the JobManager logs of the job where this is
ignored?

Thanks,
Austin

On Wed, Jul 21, 2021 at 11:47 AM Thms Hmm  wrote:

> Hey Austin,
>
> Thanks for your help.
>
> I tried to change the checkpoint interval as example. The value for it
> comes from an additional config file and is read and set within main() of
> the job.
>
> The job is running in Application mode. Basically the same configuration
> as from the official Flink website but instead of running the JobManager as
> job it is created as deployment.
>
> For the redeployment of the job the REST API is triggered to create a
> savepoint and cancel the job. After completion the deployment is updated
> and the pods are recreated. The -s  Is always added as a
> parameter to start the JobManager (standalone-job.sh). CLI is not involved.
> We have automated these steps. But I tried the steps manually and have the
> same results.
>
> I also tried to trigger a savepoint, scale the pods down, update the start
> parameter with the recent savepoint and renamed ‚kubernetes.cluster-id‘ as
> well as ‚high-availability.storageDir‘.
>
> When I trigger a savepoint with cancel, I also see that the HA config maps
> are cleaned up.
>
>
> Kr Thomas
>
> Austin Cawley-Edwards  schrieb am Mi. 21. Juli
> 2021 um 16:52:
>
>> Hi Thomas,
>>
>> I've got a few questions that will hopefully help get to find an answer:
>>
>> What job properties are you trying to change? Something like parallelism?
>>
>> What mode is your job running in? i.e., Session, Per-Job, or Application?
>>
>> Can you also describe how you're redeploying the job? Are you using the
>> Native Kubernetes integration or Standalone (i.e. writing k8s  manifest
>> files yourself)? It sounds like you are using the Flink CLI as well, is
>> that correct?
>>
>> Thanks,
>> Austin
>>
>> On Wed, Jul 21, 2021 at 4:05 AM Thms Hmm  wrote:
>>
>>> Hey,
>>>
>>> we have some application clusters running on Kubernetes and explore the
>>> HA mode which is working as expected. When we try to upgrade a job, e.g.
>>> trigger a savepoint, cancel the job and redeploy, Flink is not restarting
>>> from the savepoint we provide using the -s parameter. So all state is lost.
>>>
>>> If we just trigger the savepoint without canceling the job and redeploy
>>> the HA mode picks up from the latest savepoint.
>>>
>>> But this way we can not upgrade job properties as they were picked up
>>> from the savepoint as it seems.
>>>
>>> Is there any advice on how to do upgrades with HA enabled?
>>>
>>> Flink version is 1.12.2.
>>>
>>> Thanks for your help.
>>>
>>> Kr thomas
>>>
>>


Re: Recover from savepoints with Kubernetes HA

2021-07-21 Thread Austin Cawley-Edwards
Hi Thomas,

I've got a few questions that will hopefully help get to find an answer:

What job properties are you trying to change? Something like parallelism?

What mode is your job running in? i.e., Session, Per-Job, or Application?

Can you also describe how you're redeploying the job? Are you using the
Native Kubernetes integration or Standalone (i.e. writing k8s  manifest
files yourself)? It sounds like you are using the Flink CLI as well, is
that correct?

Thanks,
Austin

On Wed, Jul 21, 2021 at 4:05 AM Thms Hmm  wrote:

> Hey,
>
> we have some application clusters running on Kubernetes and explore the HA
> mode which is working as expected. When we try to upgrade a job, e.g.
> trigger a savepoint, cancel the job and redeploy, Flink is not restarting
> from the savepoint we provide using the -s parameter. So all state is lost.
>
> If we just trigger the savepoint without canceling the job and redeploy
> the HA mode picks up from the latest savepoint.
>
> But this way we can not upgrade job properties as they were picked up from
> the savepoint as it seems.
>
> Is there any advice on how to do upgrades with HA enabled?
>
> Flink version is 1.12.2.
>
> Thanks for your help.
>
> Kr thomas
>


Recover from savepoints with Kubernetes HA

2021-07-21 Thread Thms Hmm
Hey,

we have some application clusters running on Kubernetes and explore the HA
mode which is working as expected. When we try to upgrade a job, e.g.
trigger a savepoint, cancel the job and redeploy, Flink is not restarting
from the savepoint we provide using the -s parameter. So all state is lost.

If we just trigger the savepoint without canceling the job and redeploy the
HA mode picks up from the latest savepoint.

But this way we can not upgrade job properties as they were picked up from
the savepoint as it seems.

Is there any advice on how to do upgrades with HA enabled?

Flink version is 1.12.2.

Thanks for your help.

Kr thomas