[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.

2021-04-29 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336157#comment-17336157
 ] 

Flink Jira Bot commented on FLINK-18263:


This issue was labeled "stale-major" 7 ago and has not received any updates so 
it is being deprioritized. If this ticket is actually Major, please raise the 
priority and ask a committer to assign you the issue or revive the public 
discussion.


> Allow external checkpoints to be persisted even when the job is in "Finished" 
> state.
> 
>
> Key: FLINK-18263
> URL: https://issues.apache.org/jira/browse/FLINK-18263
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Mark Cho
>Priority: Major
>  Labels: pull-request-available, stale-major
>
> Currently, `execution.checkpointing.externalized-checkpoint-retention` 
> configuration supports two options:
> - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED 
> and SUSPENDED state.
> - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in 
> FAILED, SUSPENDED, and CANCELED state.
> This gives us control over the retention of externalized checkpoints in all 
> terminal state of a job, except for the FINISHED state.
> If the job ends up in "FINISHED" state, externalized checkpoints will be 
> automatically cleaned up and there currently is no config that will ensure 
> that these externalized checkpoints to be persisted.
> I found an old Jira ticket FLINK-4512 where this was discussed. I think it 
> would be helpful to have a config that can control the retention policy for 
> FINISHED state as well.
> - This can be useful for cases where we want to rewind a job (that reached 
> the FINISHED state) to a previous checkpoint.
> - When we use externalized checkpoints, we want to fully delegate the 
> checkpoint clean-up to an external process in all job states (without 
> cherrypicking FINISHED state to be cleaned up by Flink).
> We have a quick fix working in our fork where we've changed 
> `ExternalizedCheckpointCleanup` enum:
> {code:java}
> RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED)
> RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED)
> RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED)
> {code}
> Since this change requires changes to multiple components (e.g. config 
> values, REST API, Web UI, etc), I wanted to get the community's thoughts 
> before I invest more time in my quick fix PR (which currently only contains 
> minimal change to get this working).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.

2021-04-22 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1732#comment-1732
 ] 

Flink Jira Bot commented on FLINK-18263:


This major issue is unassigned and itself and all of its Sub-Tasks have not 
been updated for 30 days. So, it has been labeled "stale-major". If this ticket 
is indeed "major", please either assign yourself or give an update. Afterwards, 
please remove the label. In 7 days the issue will be deprioritized.

> Allow external checkpoints to be persisted even when the job is in "Finished" 
> state.
> 
>
> Key: FLINK-18263
> URL: https://issues.apache.org/jira/browse/FLINK-18263
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Mark Cho
>Priority: Major
>  Labels: pull-request-available, stale-major
>
> Currently, `execution.checkpointing.externalized-checkpoint-retention` 
> configuration supports two options:
> - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED 
> and SUSPENDED state.
> - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in 
> FAILED, SUSPENDED, and CANCELED state.
> This gives us control over the retention of externalized checkpoints in all 
> terminal state of a job, except for the FINISHED state.
> If the job ends up in "FINISHED" state, externalized checkpoints will be 
> automatically cleaned up and there currently is no config that will ensure 
> that these externalized checkpoints to be persisted.
> I found an old Jira ticket FLINK-4512 where this was discussed. I think it 
> would be helpful to have a config that can control the retention policy for 
> FINISHED state as well.
> - This can be useful for cases where we want to rewind a job (that reached 
> the FINISHED state) to a previous checkpoint.
> - When we use externalized checkpoints, we want to fully delegate the 
> checkpoint clean-up to an external process in all job states (without 
> cherrypicking FINISHED state to be cleaned up by Flink).
> We have a quick fix working in our fork where we've changed 
> `ExternalizedCheckpointCleanup` enum:
> {code:java}
> RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED)
> RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED)
> RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED)
> {code}
> Since this change requires changes to multiple components (e.g. config 
> values, REST API, Web UI, etc), I wanted to get the community's thoughts 
> before I invest more time in my quick fix PR (which currently only contains 
> minimal change to get this working).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.

2021-01-17 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266725#comment-17266725
 ] 

Congxian Qiu commented on FLINK-18263:
--

Seems there is a related [mail 
list|http://apache-flink.147419.n8.nabble.com/Flink-checkpoint-td10186.html] 
with this issue

> Allow external checkpoints to be persisted even when the job is in "Finished" 
> state.
> 
>
> Key: FLINK-18263
> URL: https://issues.apache.org/jira/browse/FLINK-18263
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Mark Cho
>Priority: Major
>  Labels: pull-request-available
>
> Currently, `execution.checkpointing.externalized-checkpoint-retention` 
> configuration supports two options:
> - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED 
> and SUSPENDED state.
> - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in 
> FAILED, SUSPENDED, and CANCELED state.
> This gives us control over the retention of externalized checkpoints in all 
> terminal state of a job, except for the FINISHED state.
> If the job ends up in "FINISHED" state, externalized checkpoints will be 
> automatically cleaned up and there currently is no config that will ensure 
> that these externalized checkpoints to be persisted.
> I found an old Jira ticket FLINK-4512 where this was discussed. I think it 
> would be helpful to have a config that can control the retention policy for 
> FINISHED state as well.
> - This can be useful for cases where we want to rewind a job (that reached 
> the FINISHED state) to a previous checkpoint.
> - When we use externalized checkpoints, we want to fully delegate the 
> checkpoint clean-up to an external process in all job states (without 
> cherrypicking FINISHED state to be cleaned up by Flink).
> We have a quick fix working in our fork where we've changed 
> `ExternalizedCheckpointCleanup` enum:
> {code:java}
> RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED)
> RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED)
> RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED)
> {code}
> Since this change requires changes to multiple components (e.g. config 
> values, REST API, Web UI, etc), I wanted to get the community's thoughts 
> before I invest more time in my quick fix PR (which currently only contains 
> minimal change to get this working).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.

2020-09-25 Thread Yun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201960#comment-17201960
 ] 

Yun Tang commented on FLINK-18263:
--

If so, [~markcho] would you consider to continue modify your PR, and I could 
assign this ticket to you if agreed.

> Allow external checkpoints to be persisted even when the job is in "Finished" 
> state.
> 
>
> Key: FLINK-18263
> URL: https://issues.apache.org/jira/browse/FLINK-18263
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Mark Cho
>Priority: Major
>  Labels: pull-request-available
>
> Currently, `execution.checkpointing.externalized-checkpoint-retention` 
> configuration supports two options:
> - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED 
> and SUSPENDED state.
> - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in 
> FAILED, SUSPENDED, and CANCELED state.
> This gives us control over the retention of externalized checkpoints in all 
> terminal state of a job, except for the FINISHED state.
> If the job ends up in "FINISHED" state, externalized checkpoints will be 
> automatically cleaned up and there currently is no config that will ensure 
> that these externalized checkpoints to be persisted.
> I found an old Jira ticket FLINK-4512 where this was discussed. I think it 
> would be helpful to have a config that can control the retention policy for 
> FINISHED state as well.
> - This can be useful for cases where we want to rewind a job (that reached 
> the FINISHED state) to a previous checkpoint.
> - When we use externalized checkpoints, we want to fully delegate the 
> checkpoint clean-up to an external process in all job states (without 
> cherrypicking FINISHED state to be cleaned up by Flink).
> We have a quick fix working in our fork where we've changed 
> `ExternalizedCheckpointCleanup` enum:
> {code:java}
> RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED)
> RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED)
> RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED)
> {code}
> Since this change requires changes to multiple components (e.g. config 
> values, REST API, Web UI, etc), I wanted to get the community's thoughts 
> before I invest more time in my quick fix PR (which currently only contains 
> minimal change to get this working).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.

2020-09-22 Thread Mark Cho (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200192#comment-17200192
 ] 

Mark Cho commented on FLINK-18263:
--

Thanks for following up [~yunta]. That sounds like a great option for us.

> Allow external checkpoints to be persisted even when the job is in "Finished" 
> state.
> 
>
> Key: FLINK-18263
> URL: https://issues.apache.org/jira/browse/FLINK-18263
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Mark Cho
>Priority: Major
>  Labels: pull-request-available
>
> Currently, `execution.checkpointing.externalized-checkpoint-retention` 
> configuration supports two options:
> - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED 
> and SUSPENDED state.
> - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in 
> FAILED, SUSPENDED, and CANCELED state.
> This gives us control over the retention of externalized checkpoints in all 
> terminal state of a job, except for the FINISHED state.
> If the job ends up in "FINISHED" state, externalized checkpoints will be 
> automatically cleaned up and there currently is no config that will ensure 
> that these externalized checkpoints to be persisted.
> I found an old Jira ticket FLINK-4512 where this was discussed. I think it 
> would be helpful to have a config that can control the retention policy for 
> FINISHED state as well.
> - This can be useful for cases where we want to rewind a job (that reached 
> the FINISHED state) to a previous checkpoint.
> - When we use externalized checkpoints, we want to fully delegate the 
> checkpoint clean-up to an external process in all job states (without 
> cherrypicking FINISHED state to be cleaned up by Flink).
> We have a quick fix working in our fork where we've changed 
> `ExternalizedCheckpointCleanup` enum:
> {code:java}
> RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED)
> RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED)
> RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED)
> {code}
> Since this change requires changes to multiple components (e.g. config 
> values, REST API, Web UI, etc), I wanted to get the community's thoughts 
> before I invest more time in my quick fix PR (which currently only contains 
> minimal change to get this working).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.

2020-09-15 Thread Yun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195961#comment-17195961
 ] 

Yun Tang commented on FLINK-18263:
--

After state processor API introduced and Flink-14942 resolved, I think we could 
offer user more freedom to play with retained checkpoint no matter for 
bootstrap or other targets. I think we could introduce anther new 
CheckpointRetentionPolicy named {{ALWAYS_RETAIN_AFTER_TERMINATION}} for expert 
users and keep most of code stay the same to not introduce big change.
Moreover, we should warn user that {{ALWAYS_RETAIN_AFTER_TERMINATION}} 
retention policy could lead more space usage left and must be cautious to use 
this with in time external clean up.

What do you think of this suggestion? [~markcho]

> Allow external checkpoints to be persisted even when the job is in "Finished" 
> state.
> 
>
> Key: FLINK-18263
> URL: https://issues.apache.org/jira/browse/FLINK-18263
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Mark Cho
>Priority: Major
>  Labels: pull-request-available
>
> Currently, `execution.checkpointing.externalized-checkpoint-retention` 
> configuration supports two options:
> - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED 
> and SUSPENDED state.
> - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in 
> FAILED, SUSPENDED, and CANCELED state.
> This gives us control over the retention of externalized checkpoints in all 
> terminal state of a job, except for the FINISHED state.
> If the job ends up in "FINISHED" state, externalized checkpoints will be 
> automatically cleaned up and there currently is no config that will ensure 
> that these externalized checkpoints to be persisted.
> I found an old Jira ticket FLINK-4512 where this was discussed. I think it 
> would be helpful to have a config that can control the retention policy for 
> FINISHED state as well.
> - This can be useful for cases where we want to rewind a job (that reached 
> the FINISHED state) to a previous checkpoint.
> - When we use externalized checkpoints, we want to fully delegate the 
> checkpoint clean-up to an external process in all job states (without 
> cherrypicking FINISHED state to be cleaned up by Flink).
> We have a quick fix working in our fork where we've changed 
> `ExternalizedCheckpointCleanup` enum:
> {code:java}
> RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED)
> RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED)
> RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED)
> {code}
> Since this change requires changes to multiple components (e.g. config 
> values, REST API, Web UI, etc), I wanted to get the community's thoughts 
> before I invest more time in my quick fix PR (which currently only contains 
> minimal change to get this working).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.

2020-06-18 Thread Mark Cho (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139250#comment-17139250
 ] 

Mark Cho commented on FLINK-18263:
--

{code:java}
Another case it would help is to move a running job if users do not want to 
take a savepoint but just want to reuse the periodical external checkpoints.
{code}
That is what we typically do and what we want to do for this case as well. 
However, in this case, no external checkpoints that we can use to redeploy due 
to "FINISHED" state causing the JM to delete all the external checkpoints.

By calling `isEndOfStream(...)` in KafkaDeserializationSchema, that puts the 
Kafka source into a "FINISHED" state. In this specific case, the job is put 
into "FINISHED" state as the current deployment (with current config) does not 
want process those records, so we have an external controller that will 
redeploy this job with some config changed that can process those records.

In this redeployment, we would like to deploy from the checkpoint so we resume 
from the offsets stored in the checkpoint and also restore other state that are 
in the checkpoint.

> Allow external checkpoints to be persisted even when the job is in "Finished" 
> state.
> 
>
> Key: FLINK-18263
> URL: https://issues.apache.org/jira/browse/FLINK-18263
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Mark Cho
>Priority: Major
>  Labels: pull-request-available
>
> Currently, `execution.checkpointing.externalized-checkpoint-retention` 
> configuration supports two options:
> - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED 
> and SUSPENDED state.
> - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in 
> FAILED, SUSPENDED, and CANCELED state.
> This gives us control over the retention of externalized checkpoints in all 
> terminal state of a job, except for the FINISHED state.
> If the job ends up in "FINISHED" state, externalized checkpoints will be 
> automatically cleaned up and there currently is no config that will ensure 
> that these externalized checkpoints to be persisted.
> I found an old Jira ticket FLINK-4512 where this was discussed. I think it 
> would be helpful to have a config that can control the retention policy for 
> FINISHED state as well.
> - This can be useful for cases where we want to rewind a job (that reached 
> the FINISHED state) to a previous checkpoint.
> - When we use externalized checkpoints, we want to fully delegate the 
> checkpoint clean-up to an external process in all job states (without 
> cherrypicking FINISHED state to be cleaned up by Flink).
> We have a quick fix working in our fork where we've changed 
> `ExternalizedCheckpointCleanup` enum:
> {code:java}
> RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED)
> RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED)
> RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED)
> {code}
> Since this change requires changes to multiple components (e.g. config 
> values, REST API, Web UI, etc), I wanted to get the community's thoughts 
> before I invest more time in my quick fix PR (which currently only contains 
> minimal change to get this working).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.

2020-06-18 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139200#comment-17139200
 ] 

Zhu Zhu commented on FLINK-18263:
-

>From 
>https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#retained-checkpoints,
> I think it had stated that "Checkpoints are by default not retained and are 
>only used to resume a job from failures".  
RETAIN_ON_CANCELLATION can help in the case that a job if continuously failing 
and users want to manually cancel it before the job had reaching the max 
failure limit. Another case it would help is to move a running job if users do 
not want to take a savepoint but just want to reuse the periodical external 
checkpoints.
So I feel that we need a valid case that `ALWAYS_RETAIN` is really needed 
before we can change this public interface.
However, it's better to let the state experts [~liyu] [~yunta] to make the 
decision.

> Allow external checkpoints to be persisted even when the job is in "Finished" 
> state.
> 
>
> Key: FLINK-18263
> URL: https://issues.apache.org/jira/browse/FLINK-18263
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Mark Cho
>Priority: Major
>  Labels: pull-request-available
>
> Currently, `execution.checkpointing.externalized-checkpoint-retention` 
> configuration supports two options:
> - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED 
> and SUSPENDED state.
> - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in 
> FAILED, SUSPENDED, and CANCELED state.
> This gives us control over the retention of externalized checkpoints in all 
> terminal state of a job, except for the FINISHED state.
> If the job ends up in "FINISHED" state, externalized checkpoints will be 
> automatically cleaned up and there currently is no config that will ensure 
> that these externalized checkpoints to be persisted.
> I found an old Jira ticket FLINK-4512 where this was discussed. I think it 
> would be helpful to have a config that can control the retention policy for 
> FINISHED state as well.
> - This can be useful for cases where we want to rewind a job (that reached 
> the FINISHED state) to a previous checkpoint.
> - When we use externalized checkpoints, we want to fully delegate the 
> checkpoint clean-up to an external process in all job states (without 
> cherrypicking FINISHED state to be cleaned up by Flink).
> We have a quick fix working in our fork where we've changed 
> `ExternalizedCheckpointCleanup` enum:
> {code:java}
> RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED)
> RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED)
> RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED)
> {code}
> Since this change requires changes to multiple components (e.g. config 
> values, REST API, Web UI, etc), I wanted to get the community's thoughts 
> before I invest more time in my quick fix PR (which currently only contains 
> minimal change to get this working).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.

2020-06-18 Thread Mark Cho (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139146#comment-17139146
 ] 

Mark Cho commented on FLINK-18263:
--

Hi [~yunta], [~zhuzh],

In our environment, we typically redeploy Flink jobs using the last available 
checkpoints (unless we know that the checkpoint is not compatible with the 
redeploy, in which case we use savepoints).

We always enable externalized checkpoints, so for a healthy, running job, we 
usually have n checkpoints where n is `state.checkpoints.num-retained`. 

We don't typically have jobs that end in `FINISHED` state so we never noticed 
this issue before but since we enabled externalized checkpoints, we were not 
expecting the JM to delete all the checkpoints on a terminal state.

In this specific job, it was using a Kafka source and used the following method:
{code:java}
KafkaDeserializationSchema::isEndOfStream(T nextElement){code}
It has some custom logic to detect whether to "finish" the current task or not, 
and at some point, all Kafka source task hits `isEndOfStream == true` and 
requires redeployment with some changed configurations.

For this specific use case, we have different solutions that can address this 
job's requirements.

However, we thought the current configuration for ExternalizedCheckpointCleanup 
is a bit awkward. When we think about using externalized checkpoints, we would 
like to externalize the clean up process of checkpoints for a job, to an 
external process. Having the JM manage the clean up process for some state 
(like "FINISHED") but not for other states ("CANCELED", "FAILED") seems strange 
given that there currently isn't a way to include "FINISHED" state in the 
"retain checkpoint" list.

[~zhuzh]'s suggestion on `NEVER_DELETE` or `ALWAYS_RETAIN` is exactly what we 
would like. My initial thoughts were that since there is already a config for 
retaining on "FAILED" and "FAILED/CANCELED", extending the config to include 
"FAILED/CANCELED/FINISHED" would be the cleanest way to achieve this, but 
having the config be "Don't Retain" and "Always Retain" is exactly what we 
would like to see.

> Allow external checkpoints to be persisted even when the job is in "Finished" 
> state.
> 
>
> Key: FLINK-18263
> URL: https://issues.apache.org/jira/browse/FLINK-18263
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Mark Cho
>Priority: Major
>  Labels: pull-request-available
>
> Currently, `execution.checkpointing.externalized-checkpoint-retention` 
> configuration supports two options:
> - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED 
> and SUSPENDED state.
> - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in 
> FAILED, SUSPENDED, and CANCELED state.
> This gives us control over the retention of externalized checkpoints in all 
> terminal state of a job, except for the FINISHED state.
> If the job ends up in "FINISHED" state, externalized checkpoints will be 
> automatically cleaned up and there currently is no config that will ensure 
> that these externalized checkpoints to be persisted.
> I found an old Jira ticket FLINK-4512 where this was discussed. I think it 
> would be helpful to have a config that can control the retention policy for 
> FINISHED state as well.
> - This can be useful for cases where we want to rewind a job (that reached 
> the FINISHED state) to a previous checkpoint.
> - When we use externalized checkpoints, we want to fully delegate the 
> checkpoint clean-up to an external process in all job states (without 
> cherrypicking FINISHED state to be cleaned up by Flink).
> We have a quick fix working in our fork where we've changed 
> `ExternalizedCheckpointCleanup` enum:
> {code:java}
> RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED)
> RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED)
> RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED)
> {code}
> Since this change requires changes to multiple components (e.g. config 
> values, REST API, Web UI, etc), I wanted to get the community's thoughts 
> before I invest more time in my quick fix PR (which currently only contains 
> minimal change to get this working).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.

2020-06-18 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139118#comment-17139118
 ] 

Zhu Zhu commented on FLINK-18263:
-

Instead of `RETAIN_ON_SUCCESS`, I think `NEVER_DELETE`/`ALWAYS_RETAIN` is more 
clear because it will retain external checkpoints even if the job is 
CANCELED/FINISHED.

But I still want to know the scenario that you need to rewind a FINISHED job. 
[~markcho]
It can help us to understand the necessity of always retaining an external 
checkpoint.

> Allow external checkpoints to be persisted even when the job is in "Finished" 
> state.
> 
>
> Key: FLINK-18263
> URL: https://issues.apache.org/jira/browse/FLINK-18263
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Mark Cho
>Priority: Major
>  Labels: pull-request-available
>
> Currently, `execution.checkpointing.externalized-checkpoint-retention` 
> configuration supports two options:
> - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED 
> and SUSPENDED state.
> - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in 
> FAILED, SUSPENDED, and CANCELED state.
> This gives us control over the retention of externalized checkpoints in all 
> terminal state of a job, except for the FINISHED state.
> If the job ends up in "FINISHED" state, externalized checkpoints will be 
> automatically cleaned up and there currently is no config that will ensure 
> that these externalized checkpoints to be persisted.
> I found an old Jira ticket FLINK-4512 where this was discussed. I think it 
> would be helpful to have a config that can control the retention policy for 
> FINISHED state as well.
> - This can be useful for cases where we want to rewind a job (that reached 
> the FINISHED state) to a previous checkpoint.
> - When we use externalized checkpoints, we want to fully delegate the 
> checkpoint clean-up to an external process in all job states (without 
> cherrypicking FINISHED state to be cleaned up by Flink).
> We have a quick fix working in our fork where we've changed 
> `ExternalizedCheckpointCleanup` enum:
> {code:java}
> RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED)
> RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED)
> RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED)
> {code}
> Since this change requires changes to multiple components (e.g. config 
> values, REST API, Web UI, etc), I wanted to get the community's thoughts 
> before I invest more time in my quick fix PR (which currently only contains 
> minimal change to get this working).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.

2020-06-11 Thread Yun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133854#comment-17133854
 ] 

Yun Tang commented on FLINK-18263:
--

I think this future depends on how we give definition for ‘{{FINISHED}}’ job 
status. If all tasks are finished, why we still need to keep that checkpoint as 
that job would already complete its life-cycle. CC [~zjwang], [~zhuzh] as they 
might give more thoughts on job status definition.

As you mentioned, we could rewind a job (that reached the FINISHED state) to a 
previous checkpoint if retained on FINISHED status. However, the time of last 
checkpoint would not be so accurate, I don't know how much this could 
contribute and manual savepoint might be more useful in your scenario.

> Allow external checkpoints to be persisted even when the job is in "Finished" 
> state.
> 
>
> Key: FLINK-18263
> URL: https://issues.apache.org/jira/browse/FLINK-18263
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Mark Cho
>Priority: Major
>  Labels: pull-request-available
>
> Currently, `execution.checkpointing.externalized-checkpoint-retention` 
> configuration supports two options:
> - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED 
> and SUSPENDED state.
> - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in 
> FAILED, SUSPENDED, and CANCELED state.
> This gives us control over the retention of externalized checkpoints in all 
> terminal state of a job, except for the FINISHED state.
> If the job ends up in "FINISHED" state, externalized checkpoints will be 
> automatically cleaned up and there currently is no config that will ensure 
> that these externalized checkpoints to be persisted.
> I found an old Jira ticket FLINK-4512 where this was discussed. I think it 
> would be helpful to have a config that can control the retention policy for 
> FINISHED state as well.
> - This can be useful for cases where we want to rewind a job (that reached 
> the FINISHED state) to a previous checkpoint.
> - When we use externalized checkpoints, we want to fully delegate the 
> checkpoint clean-up to an external process in all job states (without 
> cherrypicking FINISHED state to be cleaned up by Flink).
> We have a quick fix working in our fork where we've changed 
> `ExternalizedCheckpointCleanup` enum:
> {code:java}
> RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED)
> RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED)
> RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED)
> {code}
> Since this change requires changes to multiple components (e.g. config 
> values, REST API, Web UI, etc), I wanted to get the community's thoughts 
> before I invest more time in my quick fix PR (which currently only contains 
> minimal change to get this working).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.

2020-06-11 Thread Mark Cho (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133651#comment-17133651
 ] 

Mark Cho commented on FLINK-18263:
--

Added a PR for the proposed changes to `ExternalizedCheckpointCleanup` enum and 
`execution.checkpointing.externalized-checkpoint-retention` configuration.

The PR is WIP as it still requires changes to multiple components once we align 
on the proposed changes.

> Allow external checkpoints to be persisted even when the job is in "Finished" 
> state.
> 
>
> Key: FLINK-18263
> URL: https://issues.apache.org/jira/browse/FLINK-18263
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Mark Cho
>Priority: Major
>  Labels: pull-request-available
>
> Currently, `execution.checkpointing.externalized-checkpoint-retention` 
> configuration supports two options:
> - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED 
> and SUSPENDED state.
> - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in 
> FAILED, SUSPENDED, and CANCELED state.
> This gives us control over the retention of externalized checkpoints in all 
> terminal state of a job, except for the FINISHED state.
> If the job ends up in "FINISHED" state, externalized checkpoints will be 
> automatically cleaned up and there currently is no config that will ensure 
> that these externalized checkpoints to be persisted.
> I found an old Jira ticket FLINK-4512 where this was discussed. I think it 
> would be helpful to have a config that can control the retention policy for 
> FINISHED state as well.
> - This can be useful for cases where we want to rewind a job (that reached 
> the FINISHED state) to a previous checkpoint.
> - When we use externalized checkpoints, we want to fully delegate the 
> checkpoint clean-up to an external process in all job states (without 
> cherrypicking FINISHED state to be cleaned up by Flink).
> We have a quick fix working in our fork where we've changed 
> `ExternalizedCheckpointCleanup` enum:
> {code:java}
> RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED)
> RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED)
> RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED)
> {code}
> Since this change requires changes to multiple components (e.g. config 
> values, REST API, Web UI, etc), I wanted to get the community's thoughts 
> before I invest more time in my quick fix PR (which currently only contains 
> minimal change to get this working).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)