Thank you for telling me about the cause. It recovered by restarting jobmanager-5 and jobmanager-1.
I restart jobmanager-1 because when I restarted jobmanager-5 , checkpointing started to fail with the following message. ---- 2016-07-28 10:42:28,217 WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Failed to trigger checkpoint (19 consecutive failed attempts so far) org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /flink/flink_prod/checkpoint-counter/978ef000cca5a3aa6f3461274102f82c at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270) at org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:274) at org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:270) at org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) at org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl.pathInForeground(SetDataBuilderImpl.java:267) at org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:253) at org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:41) at org.apache.flink.shaded.org.apache.curator.framework.recipes.shared.SharedValue.trySetValue(SharedValue.java:168) at org.apache.flink.shaded.org.apache.curator.framework.recipes.shared.SharedCount.trySetCount(SharedCount.java:111) at org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter.getAndIncrement(ZooKeeperCheckpointIDCounter.java:121) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:411) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:339) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$ScheduledTrigger.run(CheckpointCoordinator.java:928) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) ---- Anyway, thank you so much for your advice. I think it would be great if the fix will be backported to 1.0.4. Regards, Hionori 2016-07-28 0:08 GMT+09:00 Ufuk Celebi <u...@apache.org>: > Thanks for the logs. Looking through them it's caused by this issue: > https://issues.apache.org/jira/browse/FLINK-3800. The ExecutionGraph > (Flink's internal scheduling structure) is not terminated properly and > tries to restart the job over and over again. > > This is fixed for 1.1.0. Is it an option for you to upgrade to 1.1 > when it's out? We might need to backport this fix for 1.0.4. The work > around is as I've described, just restart jobmanager-5. > > > > On Wed, Jul 27, 2016 at 2:55 PM, Hironori Ogibayashi > <ogibaya...@gmail.com> wrote: >> Thank you so much for your quick response. >> I am running Flink 1.0.3. >> >> I have attached jobmanager logs. The failover happend around 7/26 21:13 JST. >> >> Regards, >> Hironori >> >> 2016-07-27 21:21 GMT+09:00 Ufuk Celebi <u...@apache.org>: >>> Which version of Flink are you running on? I think this might have >>> been fixed for the 1.1 release >>> (http://people.apache.org/~uce/flink-1.1.0-rc1/). >>> >>> It looks like the ExecutionGraph is still trying to restart although >>> the JobManager is not the leader anymore. If you could provide the >>> complete logs of both JobManagers, that would be helpful to be sure >>> what is happening. >>> >>> You can recover from this by restarting the respective JobManager >>> process (by running "jobmanager.sh stop" script on that machine and >>> starting again via "jobmanager.sh start cluster") . >>> >>> – Ufuk >>> >>> On Wed, Jul 27, 2016 at 2:00 PM, Hironori Ogibayashi >>> <ogibaya...@gmail.com> wrote: >>>> Hello, >>>> >>>> I have standalone Flink cluster with JobManager HA. >>>> Last night, JobManager failovered because of the connection timeout to >>>> Zookeeper. >>>> Job is successfully running under new leader JobManager, but when >>>> I see the old leader JobManager log, it is trying to re-submit job and >>>> getting errors. ( for almost 24 hours now) >>>> >>>> Here is the log. >>>> >>>> ----- >>>> 2016-07-27 20:56:09,218 WARN >>>> org.apache.flink.runtime.jobmanager.JobManager - >>>> Discard message >>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >>>> 20:56:09 Job execution switched to status RESTARTING.) because the >>>> expected leader session ID None did not equal the received leader >>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >>>> 2016-07-27 20:56:19,218 INFO >>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>> - Recovering checkpoints from ZooKeeper. >>>> 2016-07-27 20:56:19,218 WARN >>>> org.apache.flink.runtime.jobmanager.JobManager - >>>> Discard message >>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >>>> 20:56:19 Job execution switched to status CREATED.) because the >>>> expected leader session ID None did not equal the received leader >>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >>>> 2016-07-27 20:56:19,219 INFO >>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>> - Found 1 checkpoints in ZooKeeper. >>>> 2016-07-27 20:56:19,221 INFO >>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>> - Initialized with Checkpoint 40229 @ 1469620528216 for >>>> 978ef000cca5a3aa6f3461274102f82c. Removing all older checkpoints. >>>> 2016-07-27 20:56:19,222 WARN >>>> org.apache.flink.runtime.jobmanager.JobManager - >>>> Discard message >>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >>>> 20:56:19 Job execution switched to status RUNNING.) because the >>>> expected leader session ID None did not equal the received leader >>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >>>> 2016-07-27 20:56:19,222 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0) >>>> switched from CREATED to SCHEDULED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0) >>>> switched from SCHEDULED to CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Source: Custom Source (2/3) (4c795c671ec7b548b5faac5b141c331c) >>>> switched from CREATED to CANCELED >>>> 2016-07-27 20:56:19,223 WARN >>>> org.apache.flink.runtime.jobmanager.JobManager - >>>> Discard message >>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >>>> 20:56:19 Job execution switched to status FAILING.) because the >>>> expected leader session ID None did not equal the received leader >>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Source: Custom Source (3/3) (fce3b243e5b25041aafabbd93a266dbc) >>>> switched from CREATED to CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Source: Custom Source (1/3) (e1e5154f506901539e12b0fe8c140503) >>>> switched from CREATED to CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Source: Custom Source (2/3) (f95eb0ad8fcc50e6bb9046e8700e8778) >>>> switched from CREATED to CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Source: Custom Source (3/3) (0e30de47933282533cf6dda3a22e7ddc) >>>> switched from CREATED to CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat >>>> Map (1/3) (ea260b7740d4ac8262c6500429b0ee6b) switched from CREATED to >>>> CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat >>>> Map (2/3) (cc5ab4fc296238d32078d2b4a8eb5062) switched from CREATED to >>>> CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat >>>> Map (3/3) (9694ae32fc12ec416197308f6a8cb3c1) switched from CREATED to >>>> CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> TriggerWindow(GlobalWindows(), >>>> FoldingStateDescriptor{name=window-contents, >>>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), >>>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), >>>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> >>>> Filter -> Sink: Unnamed (1/3) (9c6b27873b6ddec58ce3f82f62632152) >>>> switched from CREATED to CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> TriggerWindow(GlobalWindows(), >>>> FoldingStateDescriptor{name=window-contents, >>>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), >>>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), >>>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> >>>> Filter -> Sink: Unnamed (2/3) (47442827157e04f7e1936ec1d5c876e9) >>>> switched from CREATED to CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> TriggerWindow(GlobalWindows(), >>>> FoldingStateDescriptor{name=window-contents, >>>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), >>>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), >>>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> >>>> Filter -> Sink: Unnamed (3/3) (a1436ef922932ffbb38f5c23684a43ec) >>>> switched from CREATED to CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy >>>> - Delaying retry of job execution for 10000 ms ... >>>> 2016-07-27 20:56:19,223 WARN >>>> org.apache.flink.runtime.jobmanager.JobManager - >>>> Discard message >>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >>>> 20:56:19 Job execution switched to status RESTARTING.) because the >>>> expected leader session ID None did not equal the received leader >>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >>>> ---- >>>> >>>> Could anyone advise me why this happens and how I can recover from >>>> this situation? (restart JobManager?) >>>> >>>> Regards, >>>> Hironori Ogibayashi