更正,这个是akka timeout exception
java.util.concurrent.CompletionException: 
org.apache.flink.client.deployment.application.ApplicationExecutionException: 
Could not execute application.
                at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 ~[?:1.8.0_282]
                at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 ~[?:1.8.0_282]
                at 
java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:957) 
~[?:1.8.0_282]
                at 
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940)
 ~[?:1.8.0_282]
                at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
~[?:1.8.0_282]
                at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 ~[?:1.8.0_282]
                at 
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:257)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$runApplicationAsync$1(ApplicationDispatcherBootstrap.java:212)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[?:1.8.0_282]
                at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
[?:1.8.0_282]
                at 
org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:159)
 [flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) 
[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
 [flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
[flink-dist_2.11-1.12.2.jar:1.12.2]
Caused by: 
org.apache.flink.client.deployment.application.ApplicationExecutionException: 
Could not execute application.
                ... 11 more
Caused by: org.apache.flink.client.program.ProgramInvocationException: The main 
method caused an error: java.util.concurrent.TimeoutException: Invocation of 
public default java.util.concurrent.CompletableFuture 
org.apache.flink.runtime.webmonitor.RestfulGateway.requestJobStatus(org.apache.flink.api.common.JobID,org.apache.flink.api.common.time.Time)
 timed out.
                at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:366)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:219)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:242)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                ... 10 more
Caused by: java.util.concurrent.ExecutionException: 
java.util.concurrent.TimeoutException: Invocation of public default 
java.util.concurrent.CompletableFuture 
org.apache.flink.runtime.webmonitor.RestfulGateway.requestJobStatus(org.apache.flink.api.common.JobID,org.apache.flink.api.common.time.Time)
 timed out.
                at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
~[?:1.8.0_282]
                at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) 
~[?:1.8.0_282]
                at 
org.apache.flink.client.program.StreamContextEnvironment.getJobExecutionResult(StreamContextEnvironment.java:123)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:80)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1782)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
org.apache.flink.playgrounds.ops.clickcount.ClickEventCount.main(ClickEventCount.java:112)
 ~[?:?]
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:1.8.0_282]
                at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
~[?:1.8.0_282]
                at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:1.8.0_282]
                at java.lang.reflect.Method.invoke(Method.java:498) 
~[?:1.8.0_282]
                at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:349)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:219)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:242)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                ... 10 more
Caused by: java.util.concurrent.TimeoutException: Invocation of public default 
java.util.concurrent.CompletableFuture 
org.apache.flink.runtime.webmonitor.RestfulGateway.requestJobStatus(org.apache.flink.api.common.JobID,org.apache.flink.api.common.time.Time)
 timed out.
                at 
org.apache.flink.runtime.rpc.akka.$Proxy36.requestJobStatus(Unknown Source) 
~[?:1.12.2]
                at 
org.apache.flink.client.deployment.application.JobStatusPollingUtils.lambda$getJobResult$0(JobStatusPollingUtils.java:57)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
org.apache.flink.client.deployment.application.JobStatusPollingUtils.pollJobResultAsync(JobStatusPollingUtils.java:87)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
org.apache.flink.client.deployment.application.JobStatusPollingUtils.lambda$null$3(JobStatusPollingUtils.java:107)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                ... 9 more
Caused by: akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka://flink/user/rpc/dispatcher_1#1531007562]] after [60000 ms]. 
Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A 
typical reason for `AskTimeoutException` is that the recipient actor didn't 
send a reply.
                at 
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
                at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_282]

From: Chenyu Zheng <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Tuesday, August 3, 2021 at 2:04 PM
To: "[email protected]" <[email protected]>
Subject: 几个Flink 1.12. 2超时问题

开发者您好,

我正在尝试在Kubernetes上部署Flink 1.12.2, 使用的是native 
application部署模式。但是在测试中发现,当将作业并行度调大之后,各种timeout时有发生。根据监控看,JM和TM容器的cpu和内存都没有使用到k8s给分配的量。

在尝试调大akka.ask.timeout至1分钟,和heartbeat.timeout至2分钟之后,各种超时现象得以缓解。

我的问题是,当设置较大并行度(比如128)时,akka超时和心跳超时的各种现象都是正常的吗?如果不正常,需要用什么方式去troubleshot问题的根源呢?另外单纯一味调大各个组件的超时时间,会带来什么负面作用呢?

附件中有akka超时的jobmanager日志,TaskManager心跳超时日志稍后会发上来。

谢谢!

回复