Hi 这是TM向JM发送消息超时了,可以了看下JM是否有错误日志,或者对应的TM和JM是否有资源打满等情况,导致akka消息超时
Best, Shammon FY On Sun, Apr 23, 2023 at 2:28 PM crazy <2463829...@qq.com.invalid> wrote: > Hi, 大佬好, > 有个Flink on > Yarn程序,Flink版本使用的是flink-1.13.5,statebackend使用的是rocksdb,任务跑一段时间,就会出现如下堆栈异常: > > > 2023-04-20 22:32:08,127 INFO > org.apache.flink.runtime.taskmanager.Task > [] - Attempting to fail task externally > Source: slow_source_name_3 (15/100)#0 (39e2f191bef3484c9b629258eb5afb87). > 2023-04-20 22:32:08,146 WARN > org.apache.flink.runtime.taskmanager.Task > [] - Source: slow_source_name_3 > (15/100)#0 (39e2f191bef3484c9b629258eb5afb87) switched from RUNNING to > FAILED with failure cause: java.util.concurrent.TimeoutException: > Invocation of public abstract java.util.concurrent.CompletableFuture > org.apache.flink.runtime.jobmaster.JobMasterOperatorEventGateway.sendOperatorEventToCoordinator(org.apache.flink.runtime.executiongraph.ExecutionAttemptID,org.apache.flink.runtime.jobgraph.OperatorID,org.apache.flink.util.SerializedValue) > timed out. > > at > com.sun.proxy.$Proxy28.sendOperatorEventToCoordinator(Unknown Source) > > at > org.apache.flink.runtime.taskexecutor.rpc.RpcTaskOperatorEventGateway.sendOperatorEventToCoordinator(RpcTaskOperatorEventGateway.java:58) > > at > org.apache.flink.streaming.runtime.tasks.OperatorEventDispatcherImpl$OperatorEventGatewayImpl.sendEventToCoordinator(OperatorEventDispatcherImpl.java:114) > > at > org.apache.flink.streaming.api.operators.SourceOperator.emitLatestWatermark(SourceOperator.java:393) > > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invokeProcessingTimeCallback(StreamTask.java:1425) > > at > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$null$16(StreamTask.java:1416) > > at > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) > > at > org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) > > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:344) > > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:330) > > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:202) > > at > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:684) > > at > org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:639) > > at > org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:650) > > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623) > > at > org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:779) > > at > org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) > > at java.lang.Thread.run(Thread.java:748) > > Caused by: akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka.tcp://flink@xxxxxxxxxx:40381/user/rpc/jobmanager_2#-1762983006]] > after [10000 ms]. Message of type > [org.apache.flink.runtime.rpc.messages.RemoteFencedMessage]. A typical > reason for `AskTimeoutException` is that the recipient actor didn't send a > reply. > > at > akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) > > at > akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) > > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648) > > at > akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205) > > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) > > at > scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) > > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) > > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328) > > at > akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279) > > at > akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283) > > at > akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235) > > ... 1 more > > > > > > 查看对应进程日志, > > # > > # A fatal error has been detected by the Java Runtime Environment: > > # > > # SIGSEGV (0xb) at pc=0x0000000000000000, pid=551250, > tid=0x00007fcb1e7fc700 > > # > > # JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build > 1.8.0_131-b11) > > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode > linux-amd64 compressed oops) > > # Problematic frame: > > # C 0x0000000000000000 > > # > > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > > # > > # An error report file with more information is saved as: > > # > /data1/hadoop/local/usercache/xxx/appcache/application_1681367620163_1307/container_e28_1681367620163_1307_01_000232/hs_err_pid551250.log > > Compiled method (nm) 5494257 11953 n 0 > org.rocksdb.RocksDB::iteratorCF (native) > > total in heap [0x00007fcb8335de10,0x00007fcb8335e168] = 856 > > relocation [0x00007fcb8335df38,0x00007fcb8335df80] = 72 > > main code > [0x00007fcb8335df80,0x00007fcb8335e160] = 480 > > oops > [0x00007fcb8335e160,0x00007fcb8335e168] = 8 > > # > > # If you would like to submit a bug report, please visit: > > # http://bugreport.java.com/bugreport/crash.jsp > > # The crash happened outside the Java Virtual Machine in native code. > > # See problematic frame for where to report the bug. > > > 请问下这是什么原因呢? > > > crazy > 2463829...@qq.com > > > >