[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.
[ https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336157#comment-17336157 ] Flink Jira Bot commented on FLINK-18263: This issue was labeled "stale-major" 7 ago and has not received any updates so it is being deprioritized. If this ticket is actually Major, please raise the priority and ask a committer to assign you the issue or revive the public discussion. > Allow external checkpoints to be persisted even when the job is in "Finished" > state. > > > Key: FLINK-18263 > URL: https://issues.apache.org/jira/browse/FLINK-18263 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Mark Cho >Priority: Major > Labels: pull-request-available, stale-major > > Currently, `execution.checkpointing.externalized-checkpoint-retention` > configuration supports two options: > - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED > and SUSPENDED state. > - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in > FAILED, SUSPENDED, and CANCELED state. > This gives us control over the retention of externalized checkpoints in all > terminal state of a job, except for the FINISHED state. > If the job ends up in "FINISHED" state, externalized checkpoints will be > automatically cleaned up and there currently is no config that will ensure > that these externalized checkpoints to be persisted. > I found an old Jira ticket FLINK-4512 where this was discussed. I think it > would be helpful to have a config that can control the retention policy for > FINISHED state as well. > - This can be useful for cases where we want to rewind a job (that reached > the FINISHED state) to a previous checkpoint. > - When we use externalized checkpoints, we want to fully delegate the > checkpoint clean-up to an external process in all job states (without > cherrypicking FINISHED state to be cleaned up by Flink). > We have a quick fix working in our fork where we've changed > `ExternalizedCheckpointCleanup` enum: > {code:java} > RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED) > RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED) > RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED) > {code} > Since this change requires changes to multiple components (e.g. config > values, REST API, Web UI, etc), I wanted to get the community's thoughts > before I invest more time in my quick fix PR (which currently only contains > minimal change to get this working). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.
[ https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1732#comment-1732 ] Flink Jira Bot commented on FLINK-18263: This major issue is unassigned and itself and all of its Sub-Tasks have not been updated for 30 days. So, it has been labeled "stale-major". If this ticket is indeed "major", please either assign yourself or give an update. Afterwards, please remove the label. In 7 days the issue will be deprioritized. > Allow external checkpoints to be persisted even when the job is in "Finished" > state. > > > Key: FLINK-18263 > URL: https://issues.apache.org/jira/browse/FLINK-18263 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Mark Cho >Priority: Major > Labels: pull-request-available, stale-major > > Currently, `execution.checkpointing.externalized-checkpoint-retention` > configuration supports two options: > - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED > and SUSPENDED state. > - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in > FAILED, SUSPENDED, and CANCELED state. > This gives us control over the retention of externalized checkpoints in all > terminal state of a job, except for the FINISHED state. > If the job ends up in "FINISHED" state, externalized checkpoints will be > automatically cleaned up and there currently is no config that will ensure > that these externalized checkpoints to be persisted. > I found an old Jira ticket FLINK-4512 where this was discussed. I think it > would be helpful to have a config that can control the retention policy for > FINISHED state as well. > - This can be useful for cases where we want to rewind a job (that reached > the FINISHED state) to a previous checkpoint. > - When we use externalized checkpoints, we want to fully delegate the > checkpoint clean-up to an external process in all job states (without > cherrypicking FINISHED state to be cleaned up by Flink). > We have a quick fix working in our fork where we've changed > `ExternalizedCheckpointCleanup` enum: > {code:java} > RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED) > RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED) > RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED) > {code} > Since this change requires changes to multiple components (e.g. config > values, REST API, Web UI, etc), I wanted to get the community's thoughts > before I invest more time in my quick fix PR (which currently only contains > minimal change to get this working). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.
[ https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266725#comment-17266725 ] Congxian Qiu commented on FLINK-18263: -- Seems there is a related [mail list|http://apache-flink.147419.n8.nabble.com/Flink-checkpoint-td10186.html] with this issue > Allow external checkpoints to be persisted even when the job is in "Finished" > state. > > > Key: FLINK-18263 > URL: https://issues.apache.org/jira/browse/FLINK-18263 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Mark Cho >Priority: Major > Labels: pull-request-available > > Currently, `execution.checkpointing.externalized-checkpoint-retention` > configuration supports two options: > - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED > and SUSPENDED state. > - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in > FAILED, SUSPENDED, and CANCELED state. > This gives us control over the retention of externalized checkpoints in all > terminal state of a job, except for the FINISHED state. > If the job ends up in "FINISHED" state, externalized checkpoints will be > automatically cleaned up and there currently is no config that will ensure > that these externalized checkpoints to be persisted. > I found an old Jira ticket FLINK-4512 where this was discussed. I think it > would be helpful to have a config that can control the retention policy for > FINISHED state as well. > - This can be useful for cases where we want to rewind a job (that reached > the FINISHED state) to a previous checkpoint. > - When we use externalized checkpoints, we want to fully delegate the > checkpoint clean-up to an external process in all job states (without > cherrypicking FINISHED state to be cleaned up by Flink). > We have a quick fix working in our fork where we've changed > `ExternalizedCheckpointCleanup` enum: > {code:java} > RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED) > RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED) > RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED) > {code} > Since this change requires changes to multiple components (e.g. config > values, REST API, Web UI, etc), I wanted to get the community's thoughts > before I invest more time in my quick fix PR (which currently only contains > minimal change to get this working). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.
[ https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201960#comment-17201960 ] Yun Tang commented on FLINK-18263: -- If so, [~markcho] would you consider to continue modify your PR, and I could assign this ticket to you if agreed. > Allow external checkpoints to be persisted even when the job is in "Finished" > state. > > > Key: FLINK-18263 > URL: https://issues.apache.org/jira/browse/FLINK-18263 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Mark Cho >Priority: Major > Labels: pull-request-available > > Currently, `execution.checkpointing.externalized-checkpoint-retention` > configuration supports two options: > - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED > and SUSPENDED state. > - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in > FAILED, SUSPENDED, and CANCELED state. > This gives us control over the retention of externalized checkpoints in all > terminal state of a job, except for the FINISHED state. > If the job ends up in "FINISHED" state, externalized checkpoints will be > automatically cleaned up and there currently is no config that will ensure > that these externalized checkpoints to be persisted. > I found an old Jira ticket FLINK-4512 where this was discussed. I think it > would be helpful to have a config that can control the retention policy for > FINISHED state as well. > - This can be useful for cases where we want to rewind a job (that reached > the FINISHED state) to a previous checkpoint. > - When we use externalized checkpoints, we want to fully delegate the > checkpoint clean-up to an external process in all job states (without > cherrypicking FINISHED state to be cleaned up by Flink). > We have a quick fix working in our fork where we've changed > `ExternalizedCheckpointCleanup` enum: > {code:java} > RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED) > RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED) > RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED) > {code} > Since this change requires changes to multiple components (e.g. config > values, REST API, Web UI, etc), I wanted to get the community's thoughts > before I invest more time in my quick fix PR (which currently only contains > minimal change to get this working). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.
[ https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200192#comment-17200192 ] Mark Cho commented on FLINK-18263: -- Thanks for following up [~yunta]. That sounds like a great option for us. > Allow external checkpoints to be persisted even when the job is in "Finished" > state. > > > Key: FLINK-18263 > URL: https://issues.apache.org/jira/browse/FLINK-18263 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Mark Cho >Priority: Major > Labels: pull-request-available > > Currently, `execution.checkpointing.externalized-checkpoint-retention` > configuration supports two options: > - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED > and SUSPENDED state. > - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in > FAILED, SUSPENDED, and CANCELED state. > This gives us control over the retention of externalized checkpoints in all > terminal state of a job, except for the FINISHED state. > If the job ends up in "FINISHED" state, externalized checkpoints will be > automatically cleaned up and there currently is no config that will ensure > that these externalized checkpoints to be persisted. > I found an old Jira ticket FLINK-4512 where this was discussed. I think it > would be helpful to have a config that can control the retention policy for > FINISHED state as well. > - This can be useful for cases where we want to rewind a job (that reached > the FINISHED state) to a previous checkpoint. > - When we use externalized checkpoints, we want to fully delegate the > checkpoint clean-up to an external process in all job states (without > cherrypicking FINISHED state to be cleaned up by Flink). > We have a quick fix working in our fork where we've changed > `ExternalizedCheckpointCleanup` enum: > {code:java} > RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED) > RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED) > RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED) > {code} > Since this change requires changes to multiple components (e.g. config > values, REST API, Web UI, etc), I wanted to get the community's thoughts > before I invest more time in my quick fix PR (which currently only contains > minimal change to get this working). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.
[ https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195961#comment-17195961 ] Yun Tang commented on FLINK-18263: -- After state processor API introduced and Flink-14942 resolved, I think we could offer user more freedom to play with retained checkpoint no matter for bootstrap or other targets. I think we could introduce anther new CheckpointRetentionPolicy named {{ALWAYS_RETAIN_AFTER_TERMINATION}} for expert users and keep most of code stay the same to not introduce big change. Moreover, we should warn user that {{ALWAYS_RETAIN_AFTER_TERMINATION}} retention policy could lead more space usage left and must be cautious to use this with in time external clean up. What do you think of this suggestion? [~markcho] > Allow external checkpoints to be persisted even when the job is in "Finished" > state. > > > Key: FLINK-18263 > URL: https://issues.apache.org/jira/browse/FLINK-18263 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Mark Cho >Priority: Major > Labels: pull-request-available > > Currently, `execution.checkpointing.externalized-checkpoint-retention` > configuration supports two options: > - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED > and SUSPENDED state. > - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in > FAILED, SUSPENDED, and CANCELED state. > This gives us control over the retention of externalized checkpoints in all > terminal state of a job, except for the FINISHED state. > If the job ends up in "FINISHED" state, externalized checkpoints will be > automatically cleaned up and there currently is no config that will ensure > that these externalized checkpoints to be persisted. > I found an old Jira ticket FLINK-4512 where this was discussed. I think it > would be helpful to have a config that can control the retention policy for > FINISHED state as well. > - This can be useful for cases where we want to rewind a job (that reached > the FINISHED state) to a previous checkpoint. > - When we use externalized checkpoints, we want to fully delegate the > checkpoint clean-up to an external process in all job states (without > cherrypicking FINISHED state to be cleaned up by Flink). > We have a quick fix working in our fork where we've changed > `ExternalizedCheckpointCleanup` enum: > {code:java} > RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED) > RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED) > RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED) > {code} > Since this change requires changes to multiple components (e.g. config > values, REST API, Web UI, etc), I wanted to get the community's thoughts > before I invest more time in my quick fix PR (which currently only contains > minimal change to get this working). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.
[ https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139250#comment-17139250 ] Mark Cho commented on FLINK-18263: -- {code:java} Another case it would help is to move a running job if users do not want to take a savepoint but just want to reuse the periodical external checkpoints. {code} That is what we typically do and what we want to do for this case as well. However, in this case, no external checkpoints that we can use to redeploy due to "FINISHED" state causing the JM to delete all the external checkpoints. By calling `isEndOfStream(...)` in KafkaDeserializationSchema, that puts the Kafka source into a "FINISHED" state. In this specific case, the job is put into "FINISHED" state as the current deployment (with current config) does not want process those records, so we have an external controller that will redeploy this job with some config changed that can process those records. In this redeployment, we would like to deploy from the checkpoint so we resume from the offsets stored in the checkpoint and also restore other state that are in the checkpoint. > Allow external checkpoints to be persisted even when the job is in "Finished" > state. > > > Key: FLINK-18263 > URL: https://issues.apache.org/jira/browse/FLINK-18263 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Mark Cho >Priority: Major > Labels: pull-request-available > > Currently, `execution.checkpointing.externalized-checkpoint-retention` > configuration supports two options: > - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED > and SUSPENDED state. > - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in > FAILED, SUSPENDED, and CANCELED state. > This gives us control over the retention of externalized checkpoints in all > terminal state of a job, except for the FINISHED state. > If the job ends up in "FINISHED" state, externalized checkpoints will be > automatically cleaned up and there currently is no config that will ensure > that these externalized checkpoints to be persisted. > I found an old Jira ticket FLINK-4512 where this was discussed. I think it > would be helpful to have a config that can control the retention policy for > FINISHED state as well. > - This can be useful for cases where we want to rewind a job (that reached > the FINISHED state) to a previous checkpoint. > - When we use externalized checkpoints, we want to fully delegate the > checkpoint clean-up to an external process in all job states (without > cherrypicking FINISHED state to be cleaned up by Flink). > We have a quick fix working in our fork where we've changed > `ExternalizedCheckpointCleanup` enum: > {code:java} > RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED) > RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED) > RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED) > {code} > Since this change requires changes to multiple components (e.g. config > values, REST API, Web UI, etc), I wanted to get the community's thoughts > before I invest more time in my quick fix PR (which currently only contains > minimal change to get this working). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.
[ https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139200#comment-17139200 ] Zhu Zhu commented on FLINK-18263: - >From >https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#retained-checkpoints, > I think it had stated that "Checkpoints are by default not retained and are >only used to resume a job from failures". RETAIN_ON_CANCELLATION can help in the case that a job if continuously failing and users want to manually cancel it before the job had reaching the max failure limit. Another case it would help is to move a running job if users do not want to take a savepoint but just want to reuse the periodical external checkpoints. So I feel that we need a valid case that `ALWAYS_RETAIN` is really needed before we can change this public interface. However, it's better to let the state experts [~liyu] [~yunta] to make the decision. > Allow external checkpoints to be persisted even when the job is in "Finished" > state. > > > Key: FLINK-18263 > URL: https://issues.apache.org/jira/browse/FLINK-18263 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Mark Cho >Priority: Major > Labels: pull-request-available > > Currently, `execution.checkpointing.externalized-checkpoint-retention` > configuration supports two options: > - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED > and SUSPENDED state. > - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in > FAILED, SUSPENDED, and CANCELED state. > This gives us control over the retention of externalized checkpoints in all > terminal state of a job, except for the FINISHED state. > If the job ends up in "FINISHED" state, externalized checkpoints will be > automatically cleaned up and there currently is no config that will ensure > that these externalized checkpoints to be persisted. > I found an old Jira ticket FLINK-4512 where this was discussed. I think it > would be helpful to have a config that can control the retention policy for > FINISHED state as well. > - This can be useful for cases where we want to rewind a job (that reached > the FINISHED state) to a previous checkpoint. > - When we use externalized checkpoints, we want to fully delegate the > checkpoint clean-up to an external process in all job states (without > cherrypicking FINISHED state to be cleaned up by Flink). > We have a quick fix working in our fork where we've changed > `ExternalizedCheckpointCleanup` enum: > {code:java} > RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED) > RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED) > RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED) > {code} > Since this change requires changes to multiple components (e.g. config > values, REST API, Web UI, etc), I wanted to get the community's thoughts > before I invest more time in my quick fix PR (which currently only contains > minimal change to get this working). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.
[ https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139146#comment-17139146 ] Mark Cho commented on FLINK-18263: -- Hi [~yunta], [~zhuzh], In our environment, we typically redeploy Flink jobs using the last available checkpoints (unless we know that the checkpoint is not compatible with the redeploy, in which case we use savepoints). We always enable externalized checkpoints, so for a healthy, running job, we usually have n checkpoints where n is `state.checkpoints.num-retained`. We don't typically have jobs that end in `FINISHED` state so we never noticed this issue before but since we enabled externalized checkpoints, we were not expecting the JM to delete all the checkpoints on a terminal state. In this specific job, it was using a Kafka source and used the following method: {code:java} KafkaDeserializationSchema::isEndOfStream(T nextElement){code} It has some custom logic to detect whether to "finish" the current task or not, and at some point, all Kafka source task hits `isEndOfStream == true` and requires redeployment with some changed configurations. For this specific use case, we have different solutions that can address this job's requirements. However, we thought the current configuration for ExternalizedCheckpointCleanup is a bit awkward. When we think about using externalized checkpoints, we would like to externalize the clean up process of checkpoints for a job, to an external process. Having the JM manage the clean up process for some state (like "FINISHED") but not for other states ("CANCELED", "FAILED") seems strange given that there currently isn't a way to include "FINISHED" state in the "retain checkpoint" list. [~zhuzh]'s suggestion on `NEVER_DELETE` or `ALWAYS_RETAIN` is exactly what we would like. My initial thoughts were that since there is already a config for retaining on "FAILED" and "FAILED/CANCELED", extending the config to include "FAILED/CANCELED/FINISHED" would be the cleanest way to achieve this, but having the config be "Don't Retain" and "Always Retain" is exactly what we would like to see. > Allow external checkpoints to be persisted even when the job is in "Finished" > state. > > > Key: FLINK-18263 > URL: https://issues.apache.org/jira/browse/FLINK-18263 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Mark Cho >Priority: Major > Labels: pull-request-available > > Currently, `execution.checkpointing.externalized-checkpoint-retention` > configuration supports two options: > - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED > and SUSPENDED state. > - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in > FAILED, SUSPENDED, and CANCELED state. > This gives us control over the retention of externalized checkpoints in all > terminal state of a job, except for the FINISHED state. > If the job ends up in "FINISHED" state, externalized checkpoints will be > automatically cleaned up and there currently is no config that will ensure > that these externalized checkpoints to be persisted. > I found an old Jira ticket FLINK-4512 where this was discussed. I think it > would be helpful to have a config that can control the retention policy for > FINISHED state as well. > - This can be useful for cases where we want to rewind a job (that reached > the FINISHED state) to a previous checkpoint. > - When we use externalized checkpoints, we want to fully delegate the > checkpoint clean-up to an external process in all job states (without > cherrypicking FINISHED state to be cleaned up by Flink). > We have a quick fix working in our fork where we've changed > `ExternalizedCheckpointCleanup` enum: > {code:java} > RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED) > RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED) > RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED) > {code} > Since this change requires changes to multiple components (e.g. config > values, REST API, Web UI, etc), I wanted to get the community's thoughts > before I invest more time in my quick fix PR (which currently only contains > minimal change to get this working). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.
[ https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139118#comment-17139118 ] Zhu Zhu commented on FLINK-18263: - Instead of `RETAIN_ON_SUCCESS`, I think `NEVER_DELETE`/`ALWAYS_RETAIN` is more clear because it will retain external checkpoints even if the job is CANCELED/FINISHED. But I still want to know the scenario that you need to rewind a FINISHED job. [~markcho] It can help us to understand the necessity of always retaining an external checkpoint. > Allow external checkpoints to be persisted even when the job is in "Finished" > state. > > > Key: FLINK-18263 > URL: https://issues.apache.org/jira/browse/FLINK-18263 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Mark Cho >Priority: Major > Labels: pull-request-available > > Currently, `execution.checkpointing.externalized-checkpoint-retention` > configuration supports two options: > - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED > and SUSPENDED state. > - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in > FAILED, SUSPENDED, and CANCELED state. > This gives us control over the retention of externalized checkpoints in all > terminal state of a job, except for the FINISHED state. > If the job ends up in "FINISHED" state, externalized checkpoints will be > automatically cleaned up and there currently is no config that will ensure > that these externalized checkpoints to be persisted. > I found an old Jira ticket FLINK-4512 where this was discussed. I think it > would be helpful to have a config that can control the retention policy for > FINISHED state as well. > - This can be useful for cases where we want to rewind a job (that reached > the FINISHED state) to a previous checkpoint. > - When we use externalized checkpoints, we want to fully delegate the > checkpoint clean-up to an external process in all job states (without > cherrypicking FINISHED state to be cleaned up by Flink). > We have a quick fix working in our fork where we've changed > `ExternalizedCheckpointCleanup` enum: > {code:java} > RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED) > RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED) > RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED) > {code} > Since this change requires changes to multiple components (e.g. config > values, REST API, Web UI, etc), I wanted to get the community's thoughts > before I invest more time in my quick fix PR (which currently only contains > minimal change to get this working). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.
[ https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133854#comment-17133854 ] Yun Tang commented on FLINK-18263: -- I think this future depends on how we give definition for ‘{{FINISHED}}’ job status. If all tasks are finished, why we still need to keep that checkpoint as that job would already complete its life-cycle. CC [~zjwang], [~zhuzh] as they might give more thoughts on job status definition. As you mentioned, we could rewind a job (that reached the FINISHED state) to a previous checkpoint if retained on FINISHED status. However, the time of last checkpoint would not be so accurate, I don't know how much this could contribute and manual savepoint might be more useful in your scenario. > Allow external checkpoints to be persisted even when the job is in "Finished" > state. > > > Key: FLINK-18263 > URL: https://issues.apache.org/jira/browse/FLINK-18263 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Mark Cho >Priority: Major > Labels: pull-request-available > > Currently, `execution.checkpointing.externalized-checkpoint-retention` > configuration supports two options: > - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED > and SUSPENDED state. > - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in > FAILED, SUSPENDED, and CANCELED state. > This gives us control over the retention of externalized checkpoints in all > terminal state of a job, except for the FINISHED state. > If the job ends up in "FINISHED" state, externalized checkpoints will be > automatically cleaned up and there currently is no config that will ensure > that these externalized checkpoints to be persisted. > I found an old Jira ticket FLINK-4512 where this was discussed. I think it > would be helpful to have a config that can control the retention policy for > FINISHED state as well. > - This can be useful for cases where we want to rewind a job (that reached > the FINISHED state) to a previous checkpoint. > - When we use externalized checkpoints, we want to fully delegate the > checkpoint clean-up to an external process in all job states (without > cherrypicking FINISHED state to be cleaned up by Flink). > We have a quick fix working in our fork where we've changed > `ExternalizedCheckpointCleanup` enum: > {code:java} > RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED) > RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED) > RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED) > {code} > Since this change requires changes to multiple components (e.g. config > values, REST API, Web UI, etc), I wanted to get the community's thoughts > before I invest more time in my quick fix PR (which currently only contains > minimal change to get this working). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.
[ https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133651#comment-17133651 ] Mark Cho commented on FLINK-18263: -- Added a PR for the proposed changes to `ExternalizedCheckpointCleanup` enum and `execution.checkpointing.externalized-checkpoint-retention` configuration. The PR is WIP as it still requires changes to multiple components once we align on the proposed changes. > Allow external checkpoints to be persisted even when the job is in "Finished" > state. > > > Key: FLINK-18263 > URL: https://issues.apache.org/jira/browse/FLINK-18263 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Mark Cho >Priority: Major > Labels: pull-request-available > > Currently, `execution.checkpointing.externalized-checkpoint-retention` > configuration supports two options: > - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED > and SUSPENDED state. > - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in > FAILED, SUSPENDED, and CANCELED state. > This gives us control over the retention of externalized checkpoints in all > terminal state of a job, except for the FINISHED state. > If the job ends up in "FINISHED" state, externalized checkpoints will be > automatically cleaned up and there currently is no config that will ensure > that these externalized checkpoints to be persisted. > I found an old Jira ticket FLINK-4512 where this was discussed. I think it > would be helpful to have a config that can control the retention policy for > FINISHED state as well. > - This can be useful for cases where we want to rewind a job (that reached > the FINISHED state) to a previous checkpoint. > - When we use externalized checkpoints, we want to fully delegate the > checkpoint clean-up to an external process in all job states (without > cherrypicking FINISHED state to be cleaned up by Flink). > We have a quick fix working in our fork where we've changed > `ExternalizedCheckpointCleanup` enum: > {code:java} > RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED) > RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED) > RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED) > {code} > Since this change requires changes to multiple components (e.g. config > values, REST API, Web UI, etc), I wanted to get the community's thoughts > before I invest more time in my quick fix PR (which currently only contains > minimal change to get this working). -- This message was sent by Atlassian Jira (v8.3.4#803005)