Hi,

我看到你的作业将宿主机的 /tmp/flink 挂载到了容器内,并使用该路径作为 ha 的根路径。Flink 会在该目录下创建子目录存在 cp、ha
相关的数据。这个报错一般是对应的 ha 目录没有创建成功,建议检查下 /tmp/flink 的目录权限。

如果是这个问题,我想Flink 应该更早的暴露目录权限异常,而不是等待后续校验路径是否存在时报错。


Best,
Weihua


On Mon, Jan 9, 2023 at 5:17 PM 圣 万 <sevev...@live.com> wrote:

> 您好:
>
> 我最近在尝试使用flink-kubernetes-operator来部署flink,在官方Github项目中发现了一些example,我在部署其中一个样例时发生了错误,还请您帮忙解答下,感谢!
> 项目地址:flink-kubernetes-operator/examples at main ·
> apache/flink-kubernetes-operator (github.com)<
> https://github.com/apache/flink-kubernetes-operator/tree/main/examples>
> 所使用样例:basic-checkpoint-ha.yaml<
> https://github.com/apache/flink-kubernetes-operator/blob/main/examples/basic-checkpoint-ha.yaml
> >
> 内容如下:
> apiVersion: flink.apache.org/v1beta1
> kind: FlinkDeployment
> metadata:
>   name: basic-checkpoint-ha-example
> spec:
>   image: flink:1.15
>   flinkVersion: v1_15
>   flinkConfiguration:
>     taskmanager.numberOfTaskSlots: "2"
>     state.savepoints.dir: file:///flink-data/savepoints
>     state.checkpoints.dir: file:///flink-data/checkpoints
>     high-availability:
> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
>     high-availability.storageDir: file:///flink-data/ha
>   serviceAccount: flink
>   jobManager:
>     resource:
>       memory: "2048m"
>       cpu: 1
>   taskManager:
>     resource:
>       memory: "2048m"
>       cpu: 1
>   podTemplate:
>     spec:
>       containers:
>         - name: flink-main-container
>           volumeMounts:
>           - mountPath: /flink-data
>             name: flink-volume
>       volumes:
>       - name: flink-volume
>         hostPath:
>           # directory location on host
>           path: /tmp/flink
>           # this field is optional
>           type: Directory
>   job:
>     jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar
>     parallelism: 2
>     upgradeMode: savepoint
>     state: running
>     savepointTriggerNonce: 0
>
>
> 报错内容如下:
> 2023-01-05 18:51:12,176 INFO
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess
> [] - Stopping SessionDispatcherLeaderProcess.
> 2023-01-05 18:51:12,185 INFO
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore     [] - Stopping
> DefaultJobGraphStore.
> 2023-01-05 18:51:12,191 ERROR
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal
> error occurred in the cluster entrypoint.
> java.util.concurrent.CompletionException: java.lang.IllegalStateException:
> The base directory of the JobResultStore isn't accessible. No dirty
> JobResults can be restored.
>      at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
> ~[?:1.8.0_352]
>      at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
> [?:1.8.0_352]
>      at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
> [?:1.8.0_352]
>      at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_352]
>      at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_352]
>      at java.lang.Thread.run(Thread.java:750) [?:1.8.0_352]
> Caused by: java.lang.IllegalStateException: The base directory of the
> JobResultStore isn't accessible. No dirty JobResults can be restored.
>      at
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:193)
> ~[flink-dist-1.16.0.jar:1.16.0]
>      at
> org.apache.flink.runtime.highavailability.FileSystemJobResultStore.getDirtyResultsInternal(FileSystemJobResultStore.java:182)
> ~[flink-dist-1.16.0.jar:1.16.0]
>      at
> org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.withReadLock(AbstractThreadsafeJobResultStore.java:118)
> ~[flink-dist-1.16.0.jar:1.16.0]
>      at
> org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.getDirtyResults(AbstractThreadsafeJobResultStore.java:100)
> ~[flink-dist-1.16.0.jar:1.16.0]
>      at
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResults(SessionDispatcherLeaderProcess.java:194)
> ~[flink-dist-1.16.0.jar:1.16.0]
>      at
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
> ~[flink-dist-1.16.0.jar:1.16.0]
>      at
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResultsIfRunning(SessionDispatcherLeaderProcess.java:188)
> ~[flink-dist-1.16.0.jar:1.16.0]
>      at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
> ~[?:1.8.0_352]
>      ... 3 more
> 2023-01-05 18:51:12,211 INFO  org.apache.flink.runtime.blob.BlobServer
>                  [] - Stopped BLOB server at 0.0.0.0:6124
> 2023-01-05 18:51:12,574 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Starting the resource manager.
> 2023-01-05 18:51:13,776 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Recovered
> 0 pods from previous attempts, current attempt id is 1.
> 2023-01-05 18:51:13,777 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Recovered 0 workers from previous attempt.
> 2023-01-05 18:51:13,898 WARN  akka.actor.CoordinatedShutdown
>                  [] - Could not addJvmShutdownHook, due to: Shutdown in
> progress
> 2023-01-05 18:51:13,898 WARN  akka.actor.CoordinatedShutdown
>                  [] - Could not addJvmShutdownHook, due to: Shutdown in
> progress
> 2023-01-05 18:51:13,999 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting
> down remote daemon.
> 2023-01-05 18:51:14,000 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting
> down remote daemon.
> 2023-01-05 18:51:14,075 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote
> daemon shut down; proceeding with flushing remote transports.
> 2023-01-05 18:51:14,076 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote
> daemon shut down; proceeding with flushing remote transports.
> 2023-01-05 18:51:14,105 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remoting
> shut down.
>

回复