【问题描述】
启用HA配置之后,taskmanager pod一直处于创建-停止-创建的过程,无法启动任务

1.     任务配置和启动过程

a)      修改conf/flink.yaml配置文件,增加HA配置
kubernetes.cluster-id: realtime-monitor
high-availability: 
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
high-availability.storageDir: file:///opt/flink/checkpoint/recovery/monitor     
       // 这是一个NFS路径,以pvc挂载到pod

b)      先通过以下命令创建一个无状态部署,建立一个session集群

./bin/kubernetes-session.sh \

-Dkubernetes.secrets=cdn-res-bd-keystore:/opt/flink/kafka/res/keystore/bd,cdn-res-bd-truststore:/opt/flink/kafka/res/truststore/bd,cdn-res-bj-keystore://opt/flink/kafka/res/keystore/bj,cdn-res-bj-truststore:/opt/flink/kafka/res/truststore/bj
 \

-Dkubernetes.pod-template-file=./conf/pod-template.yaml \

-Dkubernetes.cluster-id=realtime-monitor \

-Dkubernetes.jobmanager.service-account=wuzhiheng \

-Dkubernetes.namespace=monitor \

-Dtaskmanager.numberOfTaskSlots=6 \

-Dtaskmanager.memory.process.size=8192m \

-Djobmanager.memory.process.size=2048m

c)      最后通过web ui提交一个jar包任务,jobmanager 出现如下日志

2022-08-29 23:49:04,150 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod 
realtime-monitor-taskmanager-1-13 is created.

2022-08-29 23:49:04,152 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod 
realtime-monitor-taskmanager-1-12 is created.

2022-08-29 23:49:04,161 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received new 
TaskManager pod: realtime-monitor-taskmanager-1-12

2022-08-29 23:49:04,162 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requested worker realtime-monitor-taskmanager-1-12 with resource spec 
WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), 
taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), 
managedMemSize=0 bytes, numSlots=6}.

2022-08-29 23:49:04,162 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received new 
TaskManager pod: realtime-monitor-taskmanager-1-13

2022-08-29 23:49:04,162 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requested worker realtime-monitor-taskmanager-1-13 with resource spec 
WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), 
taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), 
managedMemSize=0 bytes, numSlots=6}.

2022-08-29 23:49:07,176 WARN  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Reaching max start worker failure rate: 12 events detected in the recent 
interval, reaching the threshold 10.000000.

2022-08-29 23:49:07,176 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Will 
not retry creating worker in 3000 ms.

2022-08-29 23:49:07,176 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Worker realtime-monitor-taskmanager-1-12 with resource spec WorkerResourceSpec 
{cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 
bytes, networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, 
numSlots=6} was requested in current attempt and has not registered. Current 
pending count after removing: 1.

2022-08-29 23:49:07,176 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Worker realtime-monitor-taskmanager-1-12 is terminated. Diagnostics: Pod 
terminated, container termination statuses: [flink-main-container(exitCode=1, 
reason=Error, message=null)], pod status: Failed(reason=null, message=null)

2022-08-29 23:49:07,176 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0, 
taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes, 
networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, 
numSlots=6}, current pending count: 2.

2022-08-29 23:49:07,514 WARN  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Reaching max start worker failure rate: 13 events detected in the recent 
interval, reaching the threshold 10.000000.

2022-08-29 23:49:07,514 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Worker realtime-monitor-taskmanager-1-13 with resource spec WorkerResourceSpec 
{cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 
bytes, networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, 
numSlots=6} was requested in current attempt and has not registered. Current 
pending count after removing: 1.

2022-08-29 23:49:07,514 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Worker realtime-monitor-taskmanager-1-13 is terminated. Diagnostics: Pod 
terminated, container termination statuses: [flink-main-container(exitCode=1, 
reason=Error, message=null)], pod status: Failed(reason=null, message=null)

2022-08-29 23:49:07,515 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0, 
taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes, 
networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, 
numSlots=6}, current pending count: 2.



2022-08-29 23:49:10,190 INFO  
org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - Enabled 
external resources: []

2022-08-29 23:49:10,192 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Creating new 
TaskManager pod with name realtime-monitor-taskmanager-1-14 and resource 
<8192,6.0>.

2022-08-29 23:49:10,192 INFO  
org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - Enabled 
external resources: []

2022-08-29 23:49:10,194 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Creating new 
TaskManager pod with name realtime-monitor-taskmanager-1-15 and resource 
<8192,6.0>.

2022-08-29 23:49:10,214 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod 
realtime-monitor-taskmanager-1-15 is created.

2022-08-29 23:49:10,215 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod 
realtime-monitor-taskmanager-1-14 is created.

2022-08-29 23:49:10,237 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received new 
TaskManager pod: realtime-monitor-taskmanager-1-14

2022-08-29 23:49:10,238 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requested worker realtime-monitor-taskmanager-1-14 with resource spec 
WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), 
taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), 
managedMemSize=0 bytes, numSlots=6}

2022-08-29 23:49:10,238 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received new 
TaskManager pod: realtime-monitor-taskmanager-1-15

2022-08-29 23:49:10,238 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requested worker realtime-monitor-taskmanager-1-15 with resource spec 
WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), 
taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), 
managedMemSize=0 bytes, numSlots=6}.

2022-08-29 23:49:13,239 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Will 
not retry creating worker in 3000 ms.

2022-08-29 23:49:13,239 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Worker realtime-monitor-taskmanager-1-14 with resource spec WorkerResourceSpec 
{cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 
bytes, networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, 
numSlots=6} was requested in current attempt and has not registered. Current 
pending count after removing: 1.

2022-08-29 23:49:13,239 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Worker realtime-monitor-taskmanager-1-14 is terminated. Diagnostics: Pod 
terminated, container termination statuses: [flink-main-container(exitCode=1, 
reason=Error, message=null)], pod status: Failed(reason=null, message=null)

2022-08-29 23:49:13,239 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0, 
taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes, 
networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, 
numSlots=6}, current pending count: 2.

2.     不启用HA配置是没有问题的,flink 1.13.6和1.14.5都尝试过,都有这个问题

3.     
问题看起来类似:https://www.mail-archive.com/user-zh@flink.apache.org/msg11942.html

请问下,这可能是哪里出现问题,之前有遇到过吗

回复