Feifan Wang created FLINK-24149:
-----------------------------------

             Summary: Make checkpoint relocatable
                 Key: FLINK-24149
                 URL: https://issues.apache.org/jira/browse/FLINK-24149
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Checkpointing
            Reporter: Feifan Wang


h3. Backgroud

FLINK-5763 proposal make savepoint relocatable, checkpoint has similar 
requirements. For example, to migrate jobs to other HDFS clusters, although it 
can be achieved through a savepoint, but we prefer to use persistent 
checkpoints, especially RocksDBStateBackend incremental checkpoints have better 
performance than savepoint during snapshot and restore.

 

FLINK-8531 standardized directory layout :

 
{code:java}
/user-defined-checkpoint-dir
    |
    + 1b080b6e710aabbef8993ab18c6de98b (job's ID)
        |
        + --shared/
        + --taskowned/
        + --chk-00001/
        + --chk-00002/
        + --chk-00003/
        ...
{code}
 * State backend will create a subdirectory with the job's ID that will contain 
the actual checkpoints, such as: 
user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/
 * Each checkpoint individually will store all its files in a subdirectory that 
includes the checkpoint number, such as: 
user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/chk-00003/
 * Files shared between checkpoints will be stored in the shared/ directory in 
the same parent directory as the separate checkpoint directory, such as: 
user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/shared/
 * Similar to shared files, files owned strictly by tasks will be stored in the 
taskowned/ directory in the same parent directory as the separate checkpoint 
directory, such as: 
user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/taskowned/

h3. Proposal

Since the individually checkpoint directory does not contain complete state 
data, we cannot make it relocatable, but its parent directory can. The only 
work left is make the metadata file references relative file paths.

I proposal make these changes to _*FsCheckpointStateOutputStream*_ :
 * introduce _*checkpointDirectory*_ field
 * introduce *_entropyInjecting_* field
 * *_closeAndGetHandle()_* return _*RelativeFileStateHandle*_ with relative 
path** base on _*checkpointDirectory*_ (except entropy injecting file system)

[~yunta], [~trohrmann] , I verified this in our environment , and I will submit 
a pull request to accomplish this feature. Please help evaluate whether it is 
appropriate.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to