It will help a lot if you could share the logs of JobManager and TaskManager for the unexpected `SUSPENDED` job.
Best, Yang Xiaolong Wang <xiaolong.w...@smartnews.com> 于2022年5月16日周一 13:30写道: > Sorry for the late reply. > > I checked the logs in both jobmanager & taskmanager. > > During that time, there were no more logs there. > > How can I reproduce the issue ? > > On Thu, May 12, 2022 at 10:35 AM Yang Wang <danrtsey...@gmail.com> wrote: > >> The SUSPENDED state is usually caused by lost leadership. Maybe you could >> find more information about leader in the JobManager and TaskManager logs. >> >> Best, >> Yang >> >> Xiaolong Wang <xiaolong.w...@smartnews.com> 于2022年5月11日周三 19:18写道: >> >>> Hello, >>> >>> Recently our Flink jobs on Native K8s encountered failing in the >>> `SUSPENDED` status and got restarted for no reason. >>> >>> Flink version: 1.13.2 >>> >>> Logs: >>> ``` >>> 2022-05-11 05:01:41 >>> >>> 2022-05-10 21:01:41,771 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering >>> checkpoint 17921 (type=CHECKPOINT) @ 1652216501302 for job >>> 00000000000000000000000000000000.\n >>> 2022-05-11 05:01:43 >>> >>> 2022-05-10 21:01:42,860 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed >>> checkpoint 17921 for job 00000000000000000000000000000000 (11840 bytes in >>> 866 ms).\n >>> 2022-05-11 05:04:34 >>> >>> 2022-05-10 21:04:34,550 INFO >>> org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Creating a >>> new watch on TaskManager pods.\n >>> 2022-05-11 05:06:43 >>> >>> 2022-05-10 21:06:43,512 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering >>> checkpoint 17922 (type=CHECKPOINT) @ 1652216802860 for job >>> 00000000000000000000000000000000.\n >>> 2022-05-11 05:06:44 >>> >>> 2022-05-10 21:06:44,441 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed >>> checkpoint 17922 for job 00000000000000000000000000000000 (11840 bytes in >>> 977 ms).\n >>> 2022-05-11 05:11:45 >>> >>> 2022-05-10 21:11:44,826 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering >>> checkpoint 17923 (type=CHECKPOINT) @ 1652217104441 for job >>> 00000000000000000000000000000000.\n >>> 2022-05-11 05:11:45 >>> >>> 2022-05-10 21:11:45,537 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed >>> checkpoint 17923 for job 00000000000000000000000000000000 (11840 bytes in >>> 646 ms).\n >>> 2022-05-11 05:12:36 >>> >>> 2022-05-10 21:12:36,746 INFO >>> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess >>> [] - Stopping SessionDispatcherLeaderProcess.\n >>> 2022-05-11 05:12:36 >>> >>> 2022-05-10 21:12:36,747 INFO >>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping >>> dispatcher akka.tcp://flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n >>> 2022-05-11 05:12:36 >>> >>> 2022-05-10 21:12:36,747 INFO >>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping all >>> currently running jobs of dispatcher akka.tcp:// >>> flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n >>> 2022-05-11 05:12:36 >>> >>> 2022-05-10 21:12:36,749 INFO >>> org.apache.flink.runtime.jobmaster.JobMaster [] - Stopping the JobMaster >>> for job >>> insert-into_default_catalog.default_database.sn_fstore_location_cluster_raw_scylla_sink(00000000000000000000000000000000).\n >>> 2022-05-11 05:12:36 >>> >>> 2022-05-10 21:12:36,752 INFO >>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job >>> 00000000000000000000000000000000 reached terminal state SUSPENDED.\n >>> 2022-05-11 05:12:36 >>> >>> 2022-05-10 21:12:36,752 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job >>> insert-xxx_sink (00000000000000000000000000000000) switched from state >>> RUNNING to SUSPENDED.\n >>> 2022-05-11 05:12:36 >>> >>> org.apache.flink.util.FlinkException: Scheduler is being stopped.\n >>> 2022-05-11 05:12:36 >>> >>> at >>> org.apache.flink.runtime.scheduler.SchedulerBase.closeAsync(SchedulerBase.java:607) >>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at >>> org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:962) >>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at >>> org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:926) >>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at >>> org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:398) >>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at >>> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214) >>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563) >>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186) >>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) >>> [flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) >>> [flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >>> [flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) >>> [flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) >>> [flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >>> [flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at akka.actor.Actor$class.aroundReceive(Actor.scala:517) >>> [flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) >>> [flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) >>> [flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at akka.actor.ActorCell.invoke(ActorCell.scala:561) >>> [flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) >>> [flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at akka.dispatch.Mailbox.run(Mailbox.scala:225) >>> [flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at akka.dispatch.Mailbox.exec(Mailbox.scala:235) >>> [flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>> [flink-dist_2.11-1.13.2.jar:1.13.2]\n" >>> ... >>> ``` >>> >>> What does the state `SUSPENDED` mean ? And what may possibly cause this >>> issue ? >>> >>> Moreover, I described the jobmanager pod, and got this: >>> ``` >>> >>> Last State: Terminated >>> >>> Reason: Error >>> >>> Exit Code: 239 >>> >>> Started: Tue, 10 May 2022 22:59:42 +0800 >>> >>> Finished: Wed, 11 May 2022 05:12:42 +0800 >>> >>> ``` >>> >>> >>> Here, what does `Exit Code: 239` mean ? >>> >>> Thanks in advanced, >>> >>> Yours. >>> >>