Hi Xiangyu, thanks for reaching out to the community. Could you share the entire TaskManager and JobManager logs with us? That might help investigating what's going on.
Best, Matthias On Fri, Sep 3, 2021 at 10:07 AM Xiangyu Su <xian...@smaato.com> wrote: > Hi Yun, > Thanks alot. > I am running a test, and facing the "Job Leader lost leadership..." issue, > and also the checkpointing timeout at the same time,, not sure whether > those 2 things related to each other. > regarding your question: > 1. GC looks ok. > 2. seems like once the "Job Leader lost leadership..." happens flink job > can not successfully get restarted. > and e.g here is some logs from one job failure: > --------------- > 2021-09-02 20:41:11,345 WARN org.apache.flink.runtime.taskmanager.Task > [] - KeyedProcess -> Sink: StatsdMetricsSink (40/48)#18 > (9ab62cc148569e449fdb31b521ec976c) switched from RUNNING to FAILED with > failure cause: org.apache.flink.util.FlinkException: Disconnect from > JobManager responsible for ec6fd88643747aafac06ee906e421a96. > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1660) > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1500(TaskExecutor.java:181) > at > org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:2189) > at java.util.Optional.ifPresent(Optional.java:159) > at > org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3(TaskExecutor.java:2187) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: java.lang.Exception: Job leader for job id > ec6fd88643747aafac06ee906e421a96 lost leadership. > ... 24 more > > ----------- > 2021-09-02 20:47:22,388 ERROR > org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState [] - > Authentication failed > 2021-09-02 20:47:22,388 INFO > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Opening socket connection to server dpl-zookeeper-0.dpl-zookeeper/ > 10.168.175.10:2181 > 2021-09-02 20:47:22,388 WARN > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > SASL configuration failed: javax.security.auth.login.LoginException: No > JAAS configuration section named 'Client' was found in specified JAAS > configuration file: '/tmp/jaas-4480663428736118963.conf'. Will continue > connection to Zookeeper server without SASL authentication, if Zookeeper > server allows it. > at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) > ~[flink-dist_2.11-1.13.2.jar:1.13.2] > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) > ~[flink-dist_2.11-1.13.2.jar:1.13.2] > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212) > ~[flink-dist_2.11-1.13.2.jar:1.13.2] > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305) > ~[flink-dist_2.11-1.13.2.jar:1.13.2] > at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302] > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[?:1.8.0_302] > at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[?:?] > at > org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:441) > ~[flink-dist_2.11-1.13.2.jar:1.13.2] > org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution > attempt deb6e9dd535069eb66e2139fde5b77cd was not found. > 2021-09-02 20:47:21,870 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Cannot > find task to fail for execution deb6e9dd535069eb66e2139fde5b77cd with > exception: > at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > [flink-dist_2.11-1.13.2.jar:1.13.2] > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) > ~[flink-dist_2.11-1.13.2.jar:1.13.2] > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) > ~[flink-dist_2.11-1.13.2.jar:1.13.2] > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212) > ~[flink-dist_2.11-1.13.2.jar:1.13.2] > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305) > ~[flink-dist_2.11-1.13.2.jar:1.13.2] > at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302] > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[?:1.8.0_302] > at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[?:?] > at > org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:441) > ~[flink-dist_2.11-1.13.2.jar:1.13.2] > > Thanks for your support. > Best Regards, > > On Thu, 2 Sept 2021 at 16:43, Yun Gao <yungao...@aliyun.com> wrote: > >> Hi Xiangyu, >> >> There might be different reasons for the "Job Leader... lost leadership" >> problem. Do you see the erros >> in the TM log ? If so, the root cause might be that the connection >> between the TM and ZK is lost or >> timeout. Have you checked the GC status of the TM side ? If the GC is ok, >> could you provide more detailed >> exception stack ? >> >> Best, >> Yun >> >> >> ------------------Original Mail ------------------ >> *Sender:*Xiangyu Su <xian...@smaato.com> >> *Send Date:*Wed Sep 1 15:31:03 2021 >> *Recipients:*user <user@flink.apache.org> >> *Subject:*FLINK-14316 happens on version 1.13.2 >> >>> Hello Everyone, >>> We upgrade flink to 1.13.2, and we were facing randomly the "Job leader >>> ... lost leadership" error, the job keep restarting and failing... >>> It behaviours like this ticket >>> https://issues.apache.org/jira/browse/FLINK-14316 >>> >>> Did anybody had same issue or any suggestions? >>> >>> Best Regards, >>> >>> >>> -- >>> Xiangyu Su >>> Java Developer >>> xian...@smaato.com >>> >>> Smaato Inc. >>> San Francisco - New York - Hamburg - Singapore >>> www.smaato.com >>> >>> Germany: >>> >>> Barcastraße 5 >>> >>> 22087 Hamburg >>> >>> Germany >>> M 0049(176)43330282 >>> >>> The information contained in this communication may be CONFIDENTIAL and >>> is intended only for the use of the recipient(s) named above. If you are >>> not the intended recipient, you are hereby notified that any dissemination, >>> distribution, or copying of this communication, or any of its contents, is >>> strictly prohibited. If you have received this communication in error, >>> please notify the sender and delete/destroy the original message and any >>> copy of it from your computer or paper files. >>> >> > > -- > Xiangyu Su > Java Developer > xian...@smaato.com > > Smaato Inc. > San Francisco - New York - Hamburg - Singapore > www.smaato.com > > Germany: > > Barcastraße 5 > > 22087 Hamburg > > Germany > M 0049(176)43330282 > > The information contained in this communication may be CONFIDENTIAL and is > intended only for the use of the recipient(s) named above. If you are not > the intended recipient, you are hereby notified that any dissemination, > distribution, or copying of this communication, or any of its contents, is > strictly prohibited. If you have received this communication in error, > please notify the sender and delete/destroy the original message and any > copy of it from your computer or paper files. >