tm挂掉了,可以看下是否存在checkpoint连续失败导致OOM, 或者是大数据集大窗口运算,如果数据量大也会导致这个问题。

Xintong Song <tonysong...@gmail.com> 于2019年12月25日周三 上午10:28写道:

> 这个应该不是root cause,slot was removed通常是tm挂掉了导致的,需要找下对应的tm日志看下挂掉的原因。
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Dec 24, 2019 at 10:06 PM hiliuxg <736742...@qq.com> wrote:
>
> > 偶尔发现,分配好的slot突然就被remove了,导致作业重启,看不出是什么原因导致?CPU和FULL GC都没有,异常信息如下:
> >
> > org.apache.flink.util.FlinkException: The assigned slot
> > bae00218c818157649eb9e3c533b86af_11 was removed.
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> >
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> >
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> >
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> >
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> >
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> >
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> >
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> >
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> >
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> > akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> > akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> > akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> > akka.actor.ActorCell.invoke(ActorCell.scala:495)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> > akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> akka.dispatch.Mailbox.run(Mailbox.scala:224)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> > akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > &nbsp; &nbsp; &nbsp; &nbsp; at
> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>

回复