Hi, 我看到你的作业将宿主机的 /tmp/flink 挂载到了容器内,并使用该路径作为 ha 的根路径。Flink 会在该目录下创建子目录存在 cp、ha 相关的数据。这个报错一般是对应的 ha 目录没有创建成功,建议检查下 /tmp/flink 的目录权限。
如果是这个问题,我想Flink 应该更早的暴露目录权限异常,而不是等待后续校验路径是否存在时报错。 Best, Weihua On Mon, Jan 9, 2023 at 5:17 PM 圣 万 <sevev...@live.com> wrote: > 您好: > > 我最近在尝试使用flink-kubernetes-operator来部署flink,在官方Github项目中发现了一些example,我在部署其中一个样例时发生了错误,还请您帮忙解答下,感谢! > 项目地址:flink-kubernetes-operator/examples at main · > apache/flink-kubernetes-operator (github.com)< > https://github.com/apache/flink-kubernetes-operator/tree/main/examples> > 所使用样例:basic-checkpoint-ha.yaml< > https://github.com/apache/flink-kubernetes-operator/blob/main/examples/basic-checkpoint-ha.yaml > > > 内容如下: > apiVersion: flink.apache.org/v1beta1 > kind: FlinkDeployment > metadata: > name: basic-checkpoint-ha-example > spec: > image: flink:1.15 > flinkVersion: v1_15 > flinkConfiguration: > taskmanager.numberOfTaskSlots: "2" > state.savepoints.dir: file:///flink-data/savepoints > state.checkpoints.dir: file:///flink-data/checkpoints > high-availability: > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory > high-availability.storageDir: file:///flink-data/ha > serviceAccount: flink > jobManager: > resource: > memory: "2048m" > cpu: 1 > taskManager: > resource: > memory: "2048m" > cpu: 1 > podTemplate: > spec: > containers: > - name: flink-main-container > volumeMounts: > - mountPath: /flink-data > name: flink-volume > volumes: > - name: flink-volume > hostPath: > # directory location on host > path: /tmp/flink > # this field is optional > type: Directory > job: > jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar > parallelism: 2 > upgradeMode: savepoint > state: running > savepointTriggerNonce: 0 > > > 报错内容如下: > 2023-01-05 18:51:12,176 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [] - Stopping SessionDispatcherLeaderProcess. > 2023-01-05 18:51:12,185 INFO > org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping > DefaultJobGraphStore. > 2023-01-05 18:51:12,191 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal > error occurred in the cluster entrypoint. > java.util.concurrent.CompletionException: java.lang.IllegalStateException: > The base directory of the JobResultStore isn't accessible. No dirty > JobResults can be restored. > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) > ~[?:1.8.0_352] > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) > [?:1.8.0_352] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606) > [?:1.8.0_352] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_352] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_352] > at java.lang.Thread.run(Thread.java:750) [?:1.8.0_352] > Caused by: java.lang.IllegalStateException: The base directory of the > JobResultStore isn't accessible. No dirty JobResults can be restored. > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) > ~[flink-dist-1.16.0.jar:1.16.0] > at > org.apache.flink.runtime.highavailability.FileSystemJobResultStore.getDirtyResultsInternal(FileSystemJobResultStore.java:182) > ~[flink-dist-1.16.0.jar:1.16.0] > at > org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.withReadLock(AbstractThreadsafeJobResultStore.java:118) > ~[flink-dist-1.16.0.jar:1.16.0] > at > org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.getDirtyResults(AbstractThreadsafeJobResultStore.java:100) > ~[flink-dist-1.16.0.jar:1.16.0] > at > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResults(SessionDispatcherLeaderProcess.java:194) > ~[flink-dist-1.16.0.jar:1.16.0] > at > org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198) > ~[flink-dist-1.16.0.jar:1.16.0] > at > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResultsIfRunning(SessionDispatcherLeaderProcess.java:188) > ~[flink-dist-1.16.0.jar:1.16.0] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) > ~[?:1.8.0_352] > ... 3 more > 2023-01-05 18:51:12,211 INFO org.apache.flink.runtime.blob.BlobServer > [] - Stopped BLOB server at 0.0.0.0:6124 > 2023-01-05 18:51:12,574 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Starting the resource manager. > 2023-01-05 18:51:13,776 INFO > org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Recovered > 0 pods from previous attempts, current attempt id is 1. > 2023-01-05 18:51:13,777 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Recovered 0 workers from previous attempt. > 2023-01-05 18:51:13,898 WARN akka.actor.CoordinatedShutdown > [] - Could not addJvmShutdownHook, due to: Shutdown in > progress > 2023-01-05 18:51:13,898 WARN akka.actor.CoordinatedShutdown > [] - Could not addJvmShutdownHook, due to: Shutdown in > progress > 2023-01-05 18:51:13,999 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Shutting > down remote daemon. > 2023-01-05 18:51:14,000 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Shutting > down remote daemon. > 2023-01-05 18:51:14,075 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remote > daemon shut down; proceeding with flushing remote transports. > 2023-01-05 18:51:14,076 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remote > daemon shut down; proceeding with flushing remote transports. > 2023-01-05 18:51:14,105 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remoting > shut down. >