Robert Metzger created FLINK-22483:
--------------------------------------

             Summary: Recover checkpoints when JobMaster gains leadership
                 Key: FLINK-22483
                 URL: https://issues.apache.org/jira/browse/FLINK-22483
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.13.0
            Reporter: Robert Metzger
             Fix For: 1.14.0


Recovering checkpoints (from the CompletedCheckpointStore) is a potentially 
blocking operation, for example if the file system implementation is retrying 
to connect to a unavailable storage backend.

Currently, we are calling the CompletedCheckpointStore.recover() method from 
the main thread of the JobManager, making it unresponsive to any RPC call while 
the recover method is blocked.

By moving the recovery to the start of the JobManager (which happens 
asynchronously after the JobMaster has gained leadership), Flink will remain 
responsive (reporting a job in INITIALIZING state).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to