[ 
https://issues.apache.org/jira/browse/FLINK-32469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Cranmer resolved FLINK-32469.
-----------------------------------
    Resolution: Done

> Improve checkpoint REST APIs for programmatic access
> ----------------------------------------------------
>
>                 Key: FLINK-32469
>                 URL: https://issues.apache.org/jira/browse/FLINK-32469
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / REST
>    Affects Versions: 1.16.2, 1.17.1
>            Reporter: Hong Liang Teoh
>            Assignee: Hong Liang Teoh
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.18.0
>
>
> *Why*
> We want to enable programmatic use of the checkpoints REST API, independent 
> of the Flink dashboard.
> Currently, REST APIs that retrieve information relating to a given Flink job 
> passes through the {{{}ExecutionGraphCache{}}}. This means that all these 
> APIs will retrieve stale data depending on the {{{}web.refresh-interval{}}}, 
> which defaults to 3s. For programmatic use of the REST API, we should be able 
> to retrieve the latest / cached version depending on the client (Flink 
> dashboard gets the cached version, other clients get the updated version).
> For example, a user might want to use the REST API to retrieve the latest 
> completed checkpoint for a given Flink job. This might be useful when trying 
> to use existing checkpoints as state store when migrating a Flink job from 
> one cluster to another. See Appendix for example.
> *What*
> This change is about separating out the cache used for the checkpoints REST 
> APIs to a separate cache. This way, a user can set the timeout for the 
> checkpoints cache to 0s (disable cache), without causing much effect on the 
> user experience on the Flink dashboard.
> In addition, the checkpoint handlers first retrieve the 
> {{{}ExecutionGraph{}}}, then retrieve the {{CheckpointStatsSnapshot}} from 
> the graph. This is not needed, since the checkpoint handlers only need the 
> {{CheckpointStatsSnapshot.}} This change will mean these handlers retrieve 
> the minimal required information ({{{}CheckpointStatsSnapshot){}}} to 
> construct a reply.
>  
> *Example use case*
> When performing security patching / maintenance of the infrastructure 
> supporting the Flink cluster, we might want to transfer a given Flink job to 
> another cluster, whilst maintaining state. We can do this via the below steps:
>  # Old cluster - Select completed checkpoint on existing Flink job
>  # Old cluster - Stop the existing Flink job
>  # New cluster - Start a new Flink job with selected checkpoint
> Step 1 requires us to query the checkpoints REST API for the latest completed 
> checkpoint. With the status quo, we need to wait 3s (or whatever the 
> ExecutionGraphCache expiry may be). This is undesirable because this means 
> the Flink job will have to reprocess data equivalent to 3s / whatever the 
> execution graph cache timeout is.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to