Re: 【flink native k8s】HA配置 taskmanager pod一直重启

2022-08-31 文章 Wu,Zhiheng
找不到TM的日志。因为TM还没有启动起来,pod就挂了
我看下是否是这个原因,目前确实没有增加-Dkubernetes.taskmanager.service-account这个参数
-Dkubernetes.taskmanager.service-account这个参数是在./bin/kubernetes-session.sh启动session集群的时候加的吗

在 2022/8/31 下午4:10,“Yang Wang” 写入:

我猜测你是因为没有给TM设置service account,导致TM没有权限从K8s ConfigMap拿到leader,从而注册到RM、JM

-Dkubernetes.taskmanager.service-account=wuzhiheng \


Best,
Yang

Xuyang  于2022年8月30日周二 23:22写道:

> Hi, 能贴一下TM的日志吗,看Warn的日志貌似是TM一直起不来
> 在 2022-08-30 03:45:43,"Wu,Zhiheng"  写道:
> >【问题描述】
> >启用HA配置之后,taskmanager pod一直处于创建-停止-创建的过程,无法启动任务
> >
> >1. 任务配置和启动过程
> >
> >a)  修改conf/flink.yaml配置文件,增加HA配置
> >kubernetes.cluster-id: realtime-monitor
> >high-availability:
> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> >high-availability.storageDir:
> file:///opt/flink/checkpoint/recovery/monitor//
> 这是一个NFS路径,以pvc挂载到pod
> >
> >b)  先通过以下命令创建一个无状态部署,建立一个session集群
> >
> >./bin/kubernetes-session.sh \
> >
> 
>-Dkubernetes.secrets=cdn-res-bd-keystore:/opt/flink/kafka/res/keystore/bd,cdn-res-bd-truststore:/opt/flink/kafka/res/truststore/bd,cdn-res-bj-keystore://opt/flink/kafka/res/keystore/bj,cdn-res-bj-truststore:/opt/flink/kafka/res/truststore/bj
> \
> >
> >-Dkubernetes.pod-template-file=./conf/pod-template.yaml \
> >
> >-Dkubernetes.cluster-id=realtime-monitor \
> >
> >-Dkubernetes.jobmanager.service-account=wuzhiheng \
> >
> >-Dkubernetes.namespace=monitor \
> >
> >-Dtaskmanager.numberOfTaskSlots=6 \
> >
> >-Dtaskmanager.memory.process.size=8192m \
> >
> >-Djobmanager.memory.process.size=2048m
> >
> >c)  最后通过web ui提交一个jar包任务,jobmanager 出现如下日志
> >
> >2022-08-29 23:49:04,150 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod
> realtime-monitor-taskmanager-1-13 is created.
> >
> >2022-08-29 23:49:04,152 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod
> realtime-monitor-taskmanager-1-12 is created.
> >
> >2022-08-29 23:49:04,161 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received
> new TaskManager pod: realtime-monitor-taskmanager-1-12
> >
> >2022-08-29 23:49:04,162 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requested worker realtime-monitor-taskmanager-1-12 with resource spec
> WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
> taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
> managedMemSize=0 bytes, numSlots=6}.
> >
> >2022-08-29 23:49:04,162 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received
> new TaskManager pod: realtime-monitor-taskmanager-1-13
> >
> >2022-08-29 23:49:04,162 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requested worker realtime-monitor-taskmanager-1-13 with resource spec
> WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
> taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
> managedMemSize=0 bytes, numSlots=6}.
> >
> >2022-08-29 23:49:07,176 WARN
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Reaching max start worker failure rate: 12 events detected in the recent
> interval, reaching the threshold 10.00.
> >
> >2022-08-29 23:49:07,176 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Will not retry creating worker in 3000 ms.
> >
> >2022-08-29 23:49:07,176 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Worker realtime-monitor-taskmanager-1-12 with resource spec
> WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
> taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
> managedMemSize=0 bytes, numSlots=6} was requested in current attempt and
> has not registered. Current pending count after removing: 1.
> >
> >2022-08-29 23:49:07,176 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Worker realtime-monitor-taskmanager-1-12 is terminated. Diagnostics: Pod
> terminated, container termination statuses:
> [flink-main-container(exitCode=1, reason=Error, message=null)], pod 
status:
> Failed(reason=null, message=null)
> >
> >2022-08-29 23:49:07,176 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0,
> taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes,
> networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes,
> numSlots=6}, current pending count: 2.
> >
> 

Re: 【flink native k8s】HA配置 taskmanager pod一直重启

2022-08-31 文章 Yang Wang
我猜测你是因为没有给TM设置service account,导致TM没有权限从K8s ConfigMap拿到leader,从而注册到RM、JM

-Dkubernetes.taskmanager.service-account=wuzhiheng \


Best,
Yang

Xuyang  于2022年8月30日周二 23:22写道:

> Hi, 能贴一下TM的日志吗,看Warn的日志貌似是TM一直起不来
> 在 2022-08-30 03:45:43,"Wu,Zhiheng"  写道:
> >【问题描述】
> >启用HA配置之后,taskmanager pod一直处于创建-停止-创建的过程,无法启动任务
> >
> >1. 任务配置和启动过程
> >
> >a)  修改conf/flink.yaml配置文件,增加HA配置
> >kubernetes.cluster-id: realtime-monitor
> >high-availability:
> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> >high-availability.storageDir:
> file:///opt/flink/checkpoint/recovery/monitor//
> 这是一个NFS路径,以pvc挂载到pod
> >
> >b)  先通过以下命令创建一个无状态部署,建立一个session集群
> >
> >./bin/kubernetes-session.sh \
> >
> >-Dkubernetes.secrets=cdn-res-bd-keystore:/opt/flink/kafka/res/keystore/bd,cdn-res-bd-truststore:/opt/flink/kafka/res/truststore/bd,cdn-res-bj-keystore://opt/flink/kafka/res/keystore/bj,cdn-res-bj-truststore:/opt/flink/kafka/res/truststore/bj
> \
> >
> >-Dkubernetes.pod-template-file=./conf/pod-template.yaml \
> >
> >-Dkubernetes.cluster-id=realtime-monitor \
> >
> >-Dkubernetes.jobmanager.service-account=wuzhiheng \
> >
> >-Dkubernetes.namespace=monitor \
> >
> >-Dtaskmanager.numberOfTaskSlots=6 \
> >
> >-Dtaskmanager.memory.process.size=8192m \
> >
> >-Djobmanager.memory.process.size=2048m
> >
> >c)  最后通过web ui提交一个jar包任务,jobmanager 出现如下日志
> >
> >2022-08-29 23:49:04,150 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod
> realtime-monitor-taskmanager-1-13 is created.
> >
> >2022-08-29 23:49:04,152 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod
> realtime-monitor-taskmanager-1-12 is created.
> >
> >2022-08-29 23:49:04,161 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received
> new TaskManager pod: realtime-monitor-taskmanager-1-12
> >
> >2022-08-29 23:49:04,162 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requested worker realtime-monitor-taskmanager-1-12 with resource spec
> WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
> taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
> managedMemSize=0 bytes, numSlots=6}.
> >
> >2022-08-29 23:49:04,162 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received
> new TaskManager pod: realtime-monitor-taskmanager-1-13
> >
> >2022-08-29 23:49:04,162 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requested worker realtime-monitor-taskmanager-1-13 with resource spec
> WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
> taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
> managedMemSize=0 bytes, numSlots=6}.
> >
> >2022-08-29 23:49:07,176 WARN
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Reaching max start worker failure rate: 12 events detected in the recent
> interval, reaching the threshold 10.00.
> >
> >2022-08-29 23:49:07,176 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Will not retry creating worker in 3000 ms.
> >
> >2022-08-29 23:49:07,176 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Worker realtime-monitor-taskmanager-1-12 with resource spec
> WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
> taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
> managedMemSize=0 bytes, numSlots=6} was requested in current attempt and
> has not registered. Current pending count after removing: 1.
> >
> >2022-08-29 23:49:07,176 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Worker realtime-monitor-taskmanager-1-12 is terminated. Diagnostics: Pod
> terminated, container termination statuses:
> [flink-main-container(exitCode=1, reason=Error, message=null)], pod status:
> Failed(reason=null, message=null)
> >
> >2022-08-29 23:49:07,176 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0,
> taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes,
> networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes,
> numSlots=6}, current pending count: 2.
> >
> >2022-08-29 23:49:07,514 WARN
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Reaching max start worker failure rate: 13 events detected in the recent
> interval, reaching the threshold 10.00.
> >
> >2022-08-29 23:49:07,514 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Worker realtime-monitor-taskmanager-1-13 with resource spec
> WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
> taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
> managedMemSize=0 bytes, numSlots=6} was requested in current attempt and
> has not registered. 

Re:【flink native k8s】HA配置 taskmanager pod一直重启

2022-08-30 文章 Xuyang
Hi, 能贴一下TM的日志吗,看Warn的日志貌似是TM一直起不来
在 2022-08-30 03:45:43,"Wu,Zhiheng"  写道:
>【问题描述】
>启用HA配置之后,taskmanager pod一直处于创建-停止-创建的过程,无法启动任务
>
>1. 任务配置和启动过程
>
>a)  修改conf/flink.yaml配置文件,增加HA配置
>kubernetes.cluster-id: realtime-monitor
>high-availability: 
>org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
>high-availability.storageDir: file:///opt/flink/checkpoint/recovery/monitor
>// 这是一个NFS路径,以pvc挂载到pod
>
>b)  先通过以下命令创建一个无状态部署,建立一个session集群
>
>./bin/kubernetes-session.sh \
>
>-Dkubernetes.secrets=cdn-res-bd-keystore:/opt/flink/kafka/res/keystore/bd,cdn-res-bd-truststore:/opt/flink/kafka/res/truststore/bd,cdn-res-bj-keystore://opt/flink/kafka/res/keystore/bj,cdn-res-bj-truststore:/opt/flink/kafka/res/truststore/bj
> \
>
>-Dkubernetes.pod-template-file=./conf/pod-template.yaml \
>
>-Dkubernetes.cluster-id=realtime-monitor \
>
>-Dkubernetes.jobmanager.service-account=wuzhiheng \
>
>-Dkubernetes.namespace=monitor \
>
>-Dtaskmanager.numberOfTaskSlots=6 \
>
>-Dtaskmanager.memory.process.size=8192m \
>
>-Djobmanager.memory.process.size=2048m
>
>c)  最后通过web ui提交一个jar包任务,jobmanager 出现如下日志
>
>2022-08-29 23:49:04,150 INFO  
>org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod 
>realtime-monitor-taskmanager-1-13 is created.
>
>2022-08-29 23:49:04,152 INFO  
>org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod 
>realtime-monitor-taskmanager-1-12 is created.
>
>2022-08-29 23:49:04,161 INFO  
>org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received new 
>TaskManager pod: realtime-monitor-taskmanager-1-12
>
>2022-08-29 23:49:04,162 INFO  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Requested worker realtime-monitor-taskmanager-1-12 with resource spec 
>WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), 
>taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), 
>managedMemSize=0 bytes, numSlots=6}.
>
>2022-08-29 23:49:04,162 INFO  
>org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received new 
>TaskManager pod: realtime-monitor-taskmanager-1-13
>
>2022-08-29 23:49:04,162 INFO  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Requested worker realtime-monitor-taskmanager-1-13 with resource spec 
>WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), 
>taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), 
>managedMemSize=0 bytes, numSlots=6}.
>
>2022-08-29 23:49:07,176 WARN  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Reaching max start worker failure rate: 12 events detected in the recent 
>interval, reaching the threshold 10.00.
>
>2022-08-29 23:49:07,176 INFO  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Will not retry creating worker in 3000 ms.
>
>2022-08-29 23:49:07,176 INFO  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Worker realtime-monitor-taskmanager-1-12 with resource spec WorkerResourceSpec 
>{cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 
>bytes, networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, 
>numSlots=6} was requested in current attempt and has not registered. Current 
>pending count after removing: 1.
>
>2022-08-29 23:49:07,176 INFO  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Worker realtime-monitor-taskmanager-1-12 is terminated. Diagnostics: Pod 
>terminated, container termination statuses: [flink-main-container(exitCode=1, 
>reason=Error, message=null)], pod status: Failed(reason=null, message=null)
>
>2022-08-29 23:49:07,176 INFO  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0, 
>taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes, 
>networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, 
>numSlots=6}, current pending count: 2.
>
>2022-08-29 23:49:07,514 WARN  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Reaching max start worker failure rate: 13 events detected in the recent 
>interval, reaching the threshold 10.00.
>
>2022-08-29 23:49:07,514 INFO  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Worker realtime-monitor-taskmanager-1-13 with resource spec WorkerResourceSpec 
>{cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 
>bytes, networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, 
>numSlots=6} was requested in current attempt and has not registered. Current 
>pending count after removing: 1.
>
>2022-08-29 23:49:07,514 INFO  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Worker realtime-monitor-taskmanager-1-13 is terminated. Diagnostics: Pod 
>terminated, container termination statuses: