Re: How to gracefully handle job recovery failures

Roman Khachatryan Thu, 10 Jun 2021 09:50:27 -0700

Hi Li,

The missing file is a serialized job graph and the job recovery can't
proceed without it.
Unfortunately, the cluster can't proceed if one of the jobs can't recover.


Regards,
Roman

On Thu, Jun 10, 2021 at 6:02 AM Li Peng <li.p...@doordash.com> wrote:
>
> Hey folks, we have a cluster with HA mode enabled, and recently after doing a 
> zookeeper restart, our Kafka cluster (Flink v. 1.11.3, Scala v. 2.12) crashed 
> and was stuck in a crash loop, with the following error:
>
> 2021-06-10 02:14:52.123 [cluster-io-thread-1] ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Fatal error occurred 
> in the cluster entrypoint.
> java.util.concurrent.CompletionException: 
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job 
> id 00000000000000000000000000000000.
> at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
> at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
> at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: org.apache.flink.util.FlinkRuntimeException: Could not recover job 
> with job id 00000000000000000000000000000000.
> at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:149)
> at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:125)
> at 
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:200)
> at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:115)
> at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
> ... 3 common frames omitted
> Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted 
> JobGraph from state handle under /00000000000000000000000000000000. This 
> indicates that the retrieved state handle is broken. Try cleaning the state 
> handle store.
> at 
> org.apache.flink.runtime.jobmanager.ZooKeeperJobGraphStore.recoverJobGraph(ZooKeeperJobGraphStore.java:192)
> at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:146)
> ... 7 common frames omitted
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://myfolder/recovery/myservice/2021-05-25T04:36:33Z/submittedJobGraph06ea8512c493
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:699)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
> at 
> org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:120)
> at 
> org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:37)
> at 
> org.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.open(PluginFileSystemFactory.java:127)
> at 
> org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:69)
> at 
> org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:65)
> at 
> org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:58)
> at 
> org.apache.flink.runtime.jobmanager.ZooKeeperJobGraphStore.recoverJobGraph(ZooKeeperJobGraphStore.java:186)
> ... 8 common frames omitted
>
> We have an idea of why the file might be gone and are addressing it, but my 
> question is: how can I configure this in such a way so that a missing job 
> file doesn't trap the cluster in a forever restart loop? Is there some 
> setting to just treat this like a complete fresh deployment if the recovery 
> file is missing?
>
> Thanks!
> Li
>
>
>
>

Re: How to gracefully handle job recovery failures

Reply via email to