Hi

这是TM向JM发送消息超时了,可以了看下JM是否有错误日志,或者对应的TM和JM是否有资源打满等情况,导致akka消息超时

Best,
Shammon FY


On Sun, Apr 23, 2023 at 2:28 PM crazy <2463829...@qq.com.invalid> wrote:

> Hi, 大佬好,
> &nbsp; &nbsp; &nbsp; 有个Flink on
> Yarn程序,Flink版本使用的是flink-1.13.5,statebackend使用的是rocksdb,任务跑一段时间,就会出现如下堆栈异常:
>
>
> &nbsp; &nbsp; &nbsp;2023-04-20 22:32:08,127 INFO&nbsp;
> org.apache.flink.runtime.taskmanager.Task&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [] - Attempting to fail task externally
> Source: slow_source_name_3 (15/100)#0 (39e2f191bef3484c9b629258eb5afb87).
> 2023-04-20 22:32:08,146 WARN&nbsp;
> org.apache.flink.runtime.taskmanager.Task&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [] - Source: slow_source_name_3
> (15/100)#0 (39e2f191bef3484c9b629258eb5afb87) switched from RUNNING to
> FAILED with failure cause: java.util.concurrent.TimeoutException:
> Invocation of public abstract java.util.concurrent.CompletableFuture
> org.apache.flink.runtime.jobmaster.JobMasterOperatorEventGateway.sendOperatorEventToCoordinator(org.apache.flink.runtime.executiongraph.ExecutionAttemptID,org.apache.flink.runtime.jobgraph.OperatorID,org.apache.flink.util.SerializedValue)
> timed out.
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> com.sun.proxy.$Proxy28.sendOperatorEventToCoordinator(Unknown Source)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> org.apache.flink.runtime.taskexecutor.rpc.RpcTaskOperatorEventGateway.sendOperatorEventToCoordinator(RpcTaskOperatorEventGateway.java:58)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> org.apache.flink.streaming.runtime.tasks.OperatorEventDispatcherImpl$OperatorEventGatewayImpl.sendEventToCoordinator(OperatorEventDispatcherImpl.java:114)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> org.apache.flink.streaming.api.operators.SourceOperator.emitLatestWatermark(SourceOperator.java:393)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invokeProcessingTimeCallback(StreamTask.java:1425)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$null$16(StreamTask.java:1416)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:344)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:330)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:202)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:684)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:639)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:650)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:779)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at java.lang.Thread.run(Thread.java:748)
>
> Caused by: akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka.tcp://flink@xxxxxxxxxx:40381/user/rpc/jobmanager_2#-1762983006]]
> after [10000 ms]. Message of type
> [org.apache.flink.runtime.rpc.messages.RemoteFencedMessage]. A typical
> reason for `AskTimeoutException` is that the recipient actor didn't send a
> reply.
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
>
> &nbsp; &nbsp; &nbsp; &nbsp; at
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
>
> &nbsp; &nbsp; &nbsp; &nbsp; ... 1 more
>
>
>
>
>
> 查看对应进程日志,
>
> #
>
> # A fatal error has been detected by the Java Runtime Environment:
>
> #
>
> #&nbsp; SIGSEGV (0xb) at pc=0x0000000000000000, pid=551250,
> tid=0x00007fcb1e7fc700
>
> #
>
> # JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build
> 1.8.0_131-b11)
>
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode
> linux-amd64 compressed oops)
>
> # Problematic frame:
>
> # C&nbsp; 0x0000000000000000
>
> #
>
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
>
> #
>
> # An error report file with more information is saved as:
>
> #
> /data1/hadoop/local/usercache/xxx/appcache/application_1681367620163_1307/container_e28_1681367620163_1307_01_000232/hs_err_pid551250.log
>
> Compiled method (nm) 5494257 11953 &nbsp; &nbsp; n 0 &nbsp; &nbsp; &nbsp;
> org.rocksdb.RocksDB::iteratorCF (native)
>
> &nbsp;total in heap&nbsp; [0x00007fcb8335de10,0x00007fcb8335e168] = 856
>
> &nbsp;relocation &nbsp; &nbsp; [0x00007fcb8335df38,0x00007fcb8335df80] = 72
>
> &nbsp;main code&nbsp; &nbsp; &nbsp;
> [0x00007fcb8335df80,0x00007fcb8335e160] = 480
>
> &nbsp;oops &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> [0x00007fcb8335e160,0x00007fcb8335e168] = 8
>
> #
>
> # If you would like to submit a bug report, please visit:
>
> # &nbsp; http://bugreport.java.com/bugreport/crash.jsp
>
> # The crash happened outside the Java Virtual Machine in native code.
>
> # See problematic frame for where to report the bug.
>
>
> 请问下这是什么原因呢?&nbsp;
>
>
> crazy
> 2463829...@qq.com
>
>
>
> &nbsp;

回复