单从这个日志看不到一直 Failover ,相关任务反复初始化是指哪个任务呢? 看到了一些 akka 的链接异常,有可能是对应的 TM 异常退出了,可以再确认下 192.168.10.227:35961 这个是不是 TaskManager 地址,以及为什么退出
Best, Weihua On Tue, Jul 12, 2022 at 9:37 AM ynz...@163.com <ynz...@163.com> wrote: > 这是job managers所有日志: > 2022-07-12 09:33:02,280 INFO > org.apache.flink.configuration.GlobalConfiguration [] - Loading > configuration property: execution.shutdown-on-attached-exit, false > 2022-07-12 09:33:02,280 INFO > org.apache.flink.configuration.GlobalConfiguration [] - Loading > configuration property: pipeline.jars, > file:/home/dataxc/opt/flink-1.14.4/opt/flink-python_2.11-1.14.4.jar > 2022-07-12 09:33:02,280 INFO > org.apache.flink.configuration.GlobalConfiguration [] - Loading > configuration property: execution.checkpointing.min-pause, 8min > 2022-07-12 09:33:02,280 INFO > org.apache.flink.configuration.GlobalConfiguration [] - Loading > configuration property: restart-strategy, failure-rate > 2022-07-12 09:33:02,280 INFO > org.apache.flink.configuration.GlobalConfiguration [] - Loading > configuration property: jobmanager.memory.jvm-metaspace.size, 128m > 2022-07-12 09:33:02,280 INFO > org.apache.flink.configuration.GlobalConfiguration [] - Loading > configuration property: state.checkpoints.dir, hdfs:///flink/checkpoints > 2022-07-12 09:33:02,382 WARN akka.remote.transport.netty.NettyTransport > [] - Remote connection to [null] failed with > java.net.ConnectException: Connection refused: n103/192.168.10.227:35961 > 2022-07-12 09:33:02,383 WARN akka.remote.ReliableDeliverySupervisor > [] - Association with remote system > [akka.tcp://flink@n103:35961] > has failed, address is now gated for [50] ms. Reason: [Association failed > with [akka.tcp://flink@n103:35961]] Caused by: > [java.net.ConnectException: Connection refused: n103/192.168.10.227:35961] > 2022-07-12 09:33:02,399 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting > RPC endpoint for > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager at > akka://flink/user/rpc/resourcemanager_1 . > 2022-07-12 09:33:02,405 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Starting the resource manager. > 2022-07-12 09:33:02,479 INFO > org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider [] - > Failing over to rm2 > 2022-07-12 09:33:02,509 INFO > org.apache.flink.yarn.YarnResourceManagerDriver [] - Recovered > 0 containers from previous attempts ([]). > 2022-07-12 09:33:02,509 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Recovered 0 workers from previous attempt. > 2022-07-12 09:33:02,514 WARN akka.remote.transport.netty.NettyTransport > [] - Remote connection to [null] failed with > java.net.ConnectException: Connection refused: n103/192.168.10.227:35961 > 2022-07-12 09:33:02,515 WARN akka.remote.ReliableDeliverySupervisor > [] - Association with remote system > [akka.tcp://flink@n103:35961] > has failed, address is now gated for [50] ms. Reason: [Association failed > with [akka.tcp://flink@n103:35961]] Caused by: > [java.net.ConnectException: Connection refused: n103/192.168.10.227:35961] > 2022-07-12 09:33:02,528 INFO org.apache.hadoop.conf.Configuration > [] - resource-types.xml not found > 2022-07-12 09:33:02,528 INFO > org.apache.hadoop.yarn.util.resource.ResourceUtils [] - Unable to > find 'resource-types.xml'. > 2022-07-12 09:33:02,538 INFO > org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - > Enabled external resources: [] > 2022-07-12 09:33:02,541 INFO > org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl [] - Upper > bound of the thread pool size is 500 > 2022-07-12 09:33:02,584 WARN akka.remote.transport.netty.NettyTransport > [] - Remote connection to [null] failed with > java.net.ConnectException: Connection refused: n103/192.168.10.227:35961 > 2022-07-12 09:33:02,585 WARN akka.remote.ReliableDeliverySupervisor > [] - Association with remote system > [akka.tcp://flink@n103:35961] > has failed, address is now gated for [50] ms. Reason: [Association failed > with [akka.tcp://flink@n103:35961]] Caused by: > [java.net.ConnectException: Connection refused: n103/192.168.10.227:35961] > > > > best, > ynz...@163.com > > From: Weihua Hu > Date: 2022-07-11 19:46 > To: user-zh > Subject: Re: flink-hudi-hive > Hi, > 任务反复初始化是指一直在 Failover 吗?在 JobManager.log 里可以看到作业 Failover 原因,搜索关键字; "to > FAILED" > > Best, > Weihua > > > On Mon, Jul 11, 2022 at 2:46 PM ynz...@163.com <ynz...@163.com> wrote: > > > Hi, > > 我正在使用flink将数据写入hudi并同步至hive,将任务提交到yarn后,我从flink web > > ui看到:相关任务反复初始化,task managers无任何信息。日志中也无明确错误提示 ; > > 当我删除代码中sync_hive相关配置,并且不改变其他配置,数据能正常写入hudi ; > > 我使用的hudi-0.11.1,flink-1.14.4,hadoop-3.3.1,hive-3.1.3 ; > > > > > > > > best, > > ynz...@163.com > > >