hello 这个报错看上去并不是状态不兼容的报错。 我看代码 Sink 算子设置了uid 理论上是可以正确恢复的。

kong <[email protected]> 于2021年10月21日周四 上午10:26写道:

> hi,我遇到flink修改sink并行度后,无法从checkpoint restore问题
>
>
> flink 版本: 1.13.1
> flink on yarn
> DataStream api方式写的java job
>
>
> 试验1:不修改任何代码,cancel job后,能从指定的checkpoint恢复
>        dataStream.addSink(new Sink(config)).name("xxxx").uid("xxxx");
> 试验2:只修改sink端的并行度,job无法启动,一直是Initiating状态
>     dataStream.addSink(new
> Sink(config)).name("xxxx").uid("xxxx").setParallelism(2);
>
>
> 日志异常
> 2021-10-21 09:52:57,076 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Starting
> RPC endpoint for
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager at
> akka://flink/user/rpc/resourcemanager_0 .
> 2021-10-21 09:52:57,132 INFO
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] -
> Starting DefaultLeaderElectionService with
> ZooKeeperLeaderElectionDriver{leaderPath='/leader/dispatcher_lock'}.
> 2021-10-21 09:52:57,133 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Starting the resource manager.
> 2021-10-21 09:52:57,134 INFO
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] -
> Starting DefaultLeaderRetrievalService with
> ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/resource_manager_lock'}.
> 2021-10-21 09:52:57,135 INFO
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] -
> Starting DefaultLeaderRetrievalService with
> ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/dispatcher_lock'}.
> 2021-10-21 09:52:57,142 INFO
> org.apache.flink.runtime.dispatcher.runner.JobDispatcherLeaderProcess [] -
> Start JobDispatcherLeaderProcess.
> 2021-10-21 09:52:57,151 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Starting
> RPC endpoint for org.apache.flink.runtime.dispatcher.MiniDispatcher at
> akka://flink/user/rpc/dispatcher_1 .
> 2021-10-21 09:52:57,228 INFO
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] -
> Starting DefaultLeaderElectionService with
> ZooKeeperLeaderElectionDriver{leaderPath='/leader/57645e97919d2efebfab67e2846696e7/job_manager_lock'}.
> 2021-10-21 09:52:57,312 WARN  akka.remote.ReliableDeliverySupervisor
>                  [] - Association with remote system
> [akka.tcp://[email protected]:42535] has failed, address is now gated
> for [50] ms. Reason: [Association failed with 
> [akka.tcp://[email protected]:42535]]
> Caused by: [java.net.ConnectException: Connection refused:
> node2.hadoop/xx.xx.xx.xx:42535]
> 2021-10-21 09:52:57,313 WARN  akka.remote.transport.netty.NettyTransport
>                  [] - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused:
> node2.hadoop/xx.xx.xx.xx:42535
> 2021-10-21 09:52:57,352 INFO
> org.apache.flink.yarn.YarnResourceManagerDriver              [] - Recovered
> 0 containers from previous attempts ([]).
> 2021-10-21 09:52:57,353 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Recovered 0 workers from previous attempt.
> 2021-10-21 09:52:57,379 INFO  org.apache.hadoop.conf.Configuration
>                  [] - resource-types.xml not found
> 2021-10-21 09:52:57,379 INFO
> org.apache.hadoop.yarn.util.resource.ResourceUtils           [] - Unable to
> find 'resource-types.xml'.
> 2021-10-21 09:52:57,394 INFO
> org.apache.flink.runtime.externalresource.ExternalResourceUtils [] -
> Enabled external resources: []
> 2021-10-21 09:52:57,395 WARN  akka.remote.transport.netty.NettyTransport
>                  [] - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused:
> node2.hadoop/xx.xx.xx.xx:42535
> 2021-10-21 09:52:57,396 WARN  akka.remote.ReliableDeliverySupervisor
>                  [] - Association with remote system
> [akka.tcp://[email protected]:42535] has failed, address is now gated
> for [50] ms. Reason: [Association failed with 
> [akka.tcp://[email protected]:42535]]
> Caused by: [java.net.ConnectException: Connection refused:
> node2.hadoop/xx.xx.xx.xx:42535]
> 2021-10-21 09:52:57,400 INFO
> org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl [] - Upper
> bound of the thread pool size is 500
> 2021-10-21 09:52:57,403 INFO
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] -
> Starting DefaultLeaderElectionService with
> ZooKeeperLeaderElectionDriver{leaderPath='/leader/resource_manager_lock'}.
> 2021-10-21 09:52:57,407 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> ResourceManager akka.tcp://[email protected]:43978/user/rpc/resourcemanager_0
> was granted leadership with fencing token ae27185bb3fda634d2c109510ab54ba4
>
>
>
>
>
>

回复