单从这个日志看不到一直 Failover ,相关任务反复初始化是指哪个任务呢?
看到了一些 akka 的链接异常,有可能是对应的 TM 异常退出了,可以再确认下 192.168.10.227:35961 这个是不是
TaskManager 地址,以及为什么退出

Best,
Weihua


On Tue, Jul 12, 2022 at 9:37 AM ynz...@163.com <ynz...@163.com> wrote:

> 这是job managers所有日志:
> 2022-07-12 09:33:02,280 INFO
> org.apache.flink.configuration.GlobalConfiguration           [] - Loading
> configuration property: execution.shutdown-on-attached-exit, false
> 2022-07-12 09:33:02,280 INFO
> org.apache.flink.configuration.GlobalConfiguration           [] - Loading
> configuration property: pipeline.jars,
> file:/home/dataxc/opt/flink-1.14.4/opt/flink-python_2.11-1.14.4.jar
> 2022-07-12 09:33:02,280 INFO
> org.apache.flink.configuration.GlobalConfiguration           [] - Loading
> configuration property: execution.checkpointing.min-pause, 8min
> 2022-07-12 09:33:02,280 INFO
> org.apache.flink.configuration.GlobalConfiguration           [] - Loading
> configuration property: restart-strategy, failure-rate
> 2022-07-12 09:33:02,280 INFO
> org.apache.flink.configuration.GlobalConfiguration           [] - Loading
> configuration property: jobmanager.memory.jvm-metaspace.size, 128m
> 2022-07-12 09:33:02,280 INFO
> org.apache.flink.configuration.GlobalConfiguration           [] - Loading
> configuration property: state.checkpoints.dir, hdfs:///flink/checkpoints
> 2022-07-12 09:33:02,382 WARN  akka.remote.transport.netty.NettyTransport
>                  [] - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: n103/192.168.10.227:35961
> 2022-07-12 09:33:02,383 WARN  akka.remote.ReliableDeliverySupervisor
>                  [] - Association with remote system 
> [akka.tcp://flink@n103:35961]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@n103:35961]] Caused by:
> [java.net.ConnectException: Connection refused: n103/192.168.10.227:35961]
> 2022-07-12 09:33:02,399 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Starting
> RPC endpoint for
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager at
> akka://flink/user/rpc/resourcemanager_1 .
> 2022-07-12 09:33:02,405 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Starting the resource manager.
> 2022-07-12 09:33:02,479 INFO
> org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider [] -
> Failing over to rm2
> 2022-07-12 09:33:02,509 INFO
> org.apache.flink.yarn.YarnResourceManagerDriver              [] - Recovered
> 0 containers from previous attempts ([]).
> 2022-07-12 09:33:02,509 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Recovered 0 workers from previous attempt.
> 2022-07-12 09:33:02,514 WARN  akka.remote.transport.netty.NettyTransport
>                  [] - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: n103/192.168.10.227:35961
> 2022-07-12 09:33:02,515 WARN  akka.remote.ReliableDeliverySupervisor
>                  [] - Association with remote system 
> [akka.tcp://flink@n103:35961]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@n103:35961]] Caused by:
> [java.net.ConnectException: Connection refused: n103/192.168.10.227:35961]
> 2022-07-12 09:33:02,528 INFO  org.apache.hadoop.conf.Configuration
>                  [] - resource-types.xml not found
> 2022-07-12 09:33:02,528 INFO
> org.apache.hadoop.yarn.util.resource.ResourceUtils           [] - Unable to
> find 'resource-types.xml'.
> 2022-07-12 09:33:02,538 INFO
> org.apache.flink.runtime.externalresource.ExternalResourceUtils [] -
> Enabled external resources: []
> 2022-07-12 09:33:02,541 INFO
> org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl [] - Upper
> bound of the thread pool size is 500
> 2022-07-12 09:33:02,584 WARN  akka.remote.transport.netty.NettyTransport
>                  [] - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: n103/192.168.10.227:35961
> 2022-07-12 09:33:02,585 WARN  akka.remote.ReliableDeliverySupervisor
>                  [] - Association with remote system 
> [akka.tcp://flink@n103:35961]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@n103:35961]] Caused by:
> [java.net.ConnectException: Connection refused: n103/192.168.10.227:35961]
>
>
>
> best,
> ynz...@163.com
>
> From: Weihua Hu
> Date: 2022-07-11 19:46
> To: user-zh
> Subject: Re: flink-hudi-hive
> Hi,
> 任务反复初始化是指一直在 Failover 吗?在 JobManager.log 里可以看到作业 Failover 原因,搜索关键字; "to
> FAILED"
>
> Best,
> Weihua
>
>
> On Mon, Jul 11, 2022 at 2:46 PM ynz...@163.com <ynz...@163.com> wrote:
>
> > Hi,
> >     我正在使用flink将数据写入hudi并同步至hive,将任务提交到yarn后,我从flink web
> > ui看到:相关任务反复初始化,task managers无任何信息。日志中也无明确错误提示 ;
> >     当我删除代码中sync_hive相关配置,并且不改变其他配置,数据能正常写入hudi ;
> >     我使用的hudi-0.11.1,flink-1.14.4,hadoop-3.3.1,hive-3.1.3 ;
> >
> >
> >
> > best,
> > ynz...@163.com
> >
>

回复