Re: Flink任务启动偶尔报错PartitionNotFoundException,会自动恢复。

2020-11-23 文章 赵一旦
@Yun Tang,应该是这个问题。
请教下这几个参数具体含义。

backoff in milliseconds for partition requests of input channels
是什么逻辑,以及initial和max分别表达含义。


akka.ask.timeout这个参数相对明显,就是超时,这个以前也涉及过,在cancel/submit/savepoint等情况都可能导致集群slot陆续没掉,然后再陆续回来(pass环境,基本就是部分机器失联,然后重新连接的case)。



Yun Tang  于2020年11月23日周一 下午5:11写道:

> Hi
>
> 集群负载比较大的时候,下游一直收不到request的partition,就会导致PartitionNotFoundException,建议增大
> taskmanager.network.request-backoff.max [1][2] 以增大重试次数
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#taskmanager-network-request-backoff-max
> [2] https://juejin.cn/post/6844904185347964942#heading-8
>
>
> 祝好
> 唐云
> 
> From: 赵一旦 
> Sent: Monday, November 23, 2020 13:08
> To: user-zh@flink.apache.org 
> Subject: Re: Flink任务启动偶尔报错PartitionNotFoundException,会自动恢复。
>
> 这个报错和kafka没有关系的哈,我大概理解是提交任务的瞬间,jobManager/taskManager机器压力较大,存在机器之间心跳超时什么的?
> 这个partition应该是指flink运行图中的数据partition,我感觉。没有具体细看,每次提交的瞬间可能遇到这个问题,然后会自动重试成功。
>
> zhisheng  于2020年11月18日周三 下午10:51写道:
>
> > 是不是有 kafka 机器挂了?
> >
> > Best
> > zhisheng
> >
> > hailongwang <18868816...@163.com> 于2020年11月18日周三 下午5:56写道:
> >
> > > 感觉还有其它 root cause,可以看下还有其它日志不?
> > >
> > >
> > > Best,
> > > Hailong
> > >
> > > At 2020-11-18 15:52:57, "赵一旦"  wrote:
> > > >2020-11-18 16:51:37
> > > >org.apache.flink.runtime.io
> > .network.partition.PartitionNotFoundException:
> > > >Partition
> > > b225fa9143dfa179d3a3bd223165d5c5#3@3fee4d51f5a43001ef743f3f15e4cfb2
> > > >not found.
> > > >at org.apache.flink.runtime.io.network.partition.consumer.
> > > >RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:267)
> > > >at org.apache.flink.runtime.io.network.partition.consumer.
> > >
> > >
> >
> >RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:166)
> > > >at org.apache.flink.runtime.io.network.partition.consumer.
> > > >SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:521)
> > > >at org.apache.flink.runtime.io.network.partition.consumer.
> > >
> > >
> >
> >SingleInputGate.lambda$triggerPartitionStateCheck$1(SingleInputGate.java:765
> > > >)
> > > >at
> > java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture
> > > >.java:670)
> > > >at java.util.concurrent.CompletableFuture$UniAccept.tryFire(
> > > >CompletableFuture.java:646)
> > > >at java.util.concurrent.CompletableFuture$Completion.run(
> > > >CompletableFuture.java:456)
> > > >at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
> > > >at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(
> > > >ForkJoinExecutorConfigurator.scala:44)
> > > >at
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > > >at
> > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool
> > > >.java:1339)
> > > >at
> > > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > > >at
> > > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread
> > > >.java:107)
> > > >
> > > >
> > > >请问这是什么问题呢?
> > >
> >
>


Re: Flink任务启动偶尔报错PartitionNotFoundException,会自动恢复。

2020-11-23 文章 Yun Tang
Hi

集群负载比较大的时候,下游一直收不到request的partition,就会导致PartitionNotFoundException,建议增大 
taskmanager.network.request-backoff.max [1][2] 以增大重试次数

[1] 
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#taskmanager-network-request-backoff-max
[2] https://juejin.cn/post/6844904185347964942#heading-8


祝好
唐云

From: 赵一旦 
Sent: Monday, November 23, 2020 13:08
To: user-zh@flink.apache.org 
Subject: Re: Flink任务启动偶尔报错PartitionNotFoundException,会自动恢复。

这个报错和kafka没有关系的哈,我大概理解是提交任务的瞬间,jobManager/taskManager机器压力较大,存在机器之间心跳超时什么的?
这个partition应该是指flink运行图中的数据partition,我感觉。没有具体细看,每次提交的瞬间可能遇到这个问题,然后会自动重试成功。

zhisheng  于2020年11月18日周三 下午10:51写道:

> 是不是有 kafka 机器挂了?
>
> Best
> zhisheng
>
> hailongwang <18868816...@163.com> 于2020年11月18日周三 下午5:56写道:
>
> > 感觉还有其它 root cause,可以看下还有其它日志不?
> >
> >
> > Best,
> > Hailong
> >
> > At 2020-11-18 15:52:57, "赵一旦"  wrote:
> > >2020-11-18 16:51:37
> > >org.apache.flink.runtime.io
> .network.partition.PartitionNotFoundException:
> > >Partition
> > b225fa9143dfa179d3a3bd223165d5c5#3@3fee4d51f5a43001ef743f3f15e4cfb2
> > >not found.
> > >at org.apache.flink.runtime.io.network.partition.consumer.
> > >RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:267)
> > >at org.apache.flink.runtime.io.network.partition.consumer.
> >
> >
> >RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:166)
> > >at org.apache.flink.runtime.io.network.partition.consumer.
> > >SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:521)
> > >at org.apache.flink.runtime.io.network.partition.consumer.
> >
> >
> >SingleInputGate.lambda$triggerPartitionStateCheck$1(SingleInputGate.java:765
> > >)
> > >at
> java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture
> > >.java:670)
> > >at java.util.concurrent.CompletableFuture$UniAccept.tryFire(
> > >CompletableFuture.java:646)
> > >at java.util.concurrent.CompletableFuture$Completion.run(
> > >CompletableFuture.java:456)
> > >at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
> > >at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(
> > >ForkJoinExecutorConfigurator.scala:44)
> > >at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > >at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool
> > >.java:1339)
> > >at
> > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > >at
> > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread
> > >.java:107)
> > >
> > >
> > >请问这是什么问题呢?
> >
>


Re: Flink任务启动偶尔报错PartitionNotFoundException,会自动恢复。

2020-11-22 文章 赵一旦
这个报错和kafka没有关系的哈,我大概理解是提交任务的瞬间,jobManager/taskManager机器压力较大,存在机器之间心跳超时什么的?
这个partition应该是指flink运行图中的数据partition,我感觉。没有具体细看,每次提交的瞬间可能遇到这个问题,然后会自动重试成功。

zhisheng  于2020年11月18日周三 下午10:51写道:

> 是不是有 kafka 机器挂了?
>
> Best
> zhisheng
>
> hailongwang <18868816...@163.com> 于2020年11月18日周三 下午5:56写道:
>
> > 感觉还有其它 root cause,可以看下还有其它日志不?
> >
> >
> > Best,
> > Hailong
> >
> > At 2020-11-18 15:52:57, "赵一旦"  wrote:
> > >2020-11-18 16:51:37
> > >org.apache.flink.runtime.io
> .network.partition.PartitionNotFoundException:
> > >Partition
> > b225fa9143dfa179d3a3bd223165d5c5#3@3fee4d51f5a43001ef743f3f15e4cfb2
> > >not found.
> > >at org.apache.flink.runtime.io.network.partition.consumer.
> > >RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:267)
> > >at org.apache.flink.runtime.io.network.partition.consumer.
> >
> >
> >RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:166)
> > >at org.apache.flink.runtime.io.network.partition.consumer.
> > >SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:521)
> > >at org.apache.flink.runtime.io.network.partition.consumer.
> >
> >
> >SingleInputGate.lambda$triggerPartitionStateCheck$1(SingleInputGate.java:765
> > >)
> > >at
> java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture
> > >.java:670)
> > >at java.util.concurrent.CompletableFuture$UniAccept.tryFire(
> > >CompletableFuture.java:646)
> > >at java.util.concurrent.CompletableFuture$Completion.run(
> > >CompletableFuture.java:456)
> > >at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
> > >at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(
> > >ForkJoinExecutorConfigurator.scala:44)
> > >at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > >at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool
> > >.java:1339)
> > >at
> > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > >at
> > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread
> > >.java:107)
> > >
> > >
> > >请问这是什么问题呢?
> >
>


Re: Flink任务启动偶尔报错PartitionNotFoundException,会自动恢复。

2020-11-18 文章 zhisheng
是不是有 kafka 机器挂了?

Best
zhisheng

hailongwang <18868816...@163.com> 于2020年11月18日周三 下午5:56写道:

> 感觉还有其它 root cause,可以看下还有其它日志不?
>
>
> Best,
> Hailong
>
> At 2020-11-18 15:52:57, "赵一旦"  wrote:
> >2020-11-18 16:51:37
> >org.apache.flink.runtime.io.network.partition.PartitionNotFoundException:
> >Partition
> b225fa9143dfa179d3a3bd223165d5c5#3@3fee4d51f5a43001ef743f3f15e4cfb2
> >not found.
> >at org.apache.flink.runtime.io.network.partition.consumer.
> >RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:267)
> >at org.apache.flink.runtime.io.network.partition.consumer.
>
> >RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:166)
> >at org.apache.flink.runtime.io.network.partition.consumer.
> >SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:521)
> >at org.apache.flink.runtime.io.network.partition.consumer.
>
> >SingleInputGate.lambda$triggerPartitionStateCheck$1(SingleInputGate.java:765
> >)
> >at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture
> >.java:670)
> >at java.util.concurrent.CompletableFuture$UniAccept.tryFire(
> >CompletableFuture.java:646)
> >at java.util.concurrent.CompletableFuture$Completion.run(
> >CompletableFuture.java:456)
> >at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
> >at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(
> >ForkJoinExecutorConfigurator.scala:44)
> >at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool
> >.java:1339)
> >at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> >at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread
> >.java:107)
> >
> >
> >请问这是什么问题呢?
>


Flink任务启动偶尔报错PartitionNotFoundException,会自动恢复。

2020-11-18 文章 赵一旦
2020-11-18 16:51:37
org.apache.flink.runtime.io.network.partition.PartitionNotFoundException:
Partition b225fa9143dfa179d3a3bd223165d5c5#3@3fee4d51f5a43001ef743f3f15e4cfb2
not found.
at org.apache.flink.runtime.io.network.partition.consumer.
RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:267)
at org.apache.flink.runtime.io.network.partition.consumer.
RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:166)
at org.apache.flink.runtime.io.network.partition.consumer.
SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:521)
at org.apache.flink.runtime.io.network.partition.consumer.
SingleInputGate.lambda$triggerPartitionStateCheck$1(SingleInputGate.java:765
)
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture
.java:670)
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(
CompletableFuture.java:646)
at java.util.concurrent.CompletableFuture$Completion.run(
CompletableFuture.java:456)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(
ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool
.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread
.java:107)


请问这是什么问题呢?