Re: JobManager trying to re-submit jobs after failover

Hironori Ogibayashi Wed, 27 Jul 2016 20:32:16 -0700

Thank you for telling me about the cause.
It recovered by restarting jobmanager-5 and jobmanager-1.


I restart jobmanager-1 because when I restarted jobmanager-5 ,
checkpointing started to
fail with the following message.

----
2016-07-28 10:42:28,217 WARN
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Failed
to trigger checkpoint (19 consecutive failed attempts so far)
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode
= NoNode for 
/flink/flink_prod/checkpoint-counter/978ef000cca5a3aa6f3461274102f82c
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:274)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:270)
        at 
org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl.pathInForeground(SetDataBuilderImpl.java:267)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:253)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:41)
        at 
org.apache.flink.shaded.org.apache.curator.framework.recipes.shared.SharedValue.trySetValue(SharedValue.java:168)
        at 
org.apache.flink.shaded.org.apache.curator.framework.recipes.shared.SharedCount.trySetCount(SharedCount.java:111)
        at 
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter.getAndIncrement(ZooKeeperCheckpointIDCounter.java:121)
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:411)
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:339)
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$ScheduledTrigger.run(CheckpointCoordinator.java:928)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
----

Anyway, thank you so much for your advice.
I think it would be great if the fix will be backported to 1.0.4.

Regards,
Hionori

2016-07-28 0:08 GMT+09:00 Ufuk Celebi <u...@apache.org>:
> Thanks for the logs. Looking through them it's caused by this issue:
> https://issues.apache.org/jira/browse/FLINK-3800. The ExecutionGraph
> (Flink's internal scheduling structure) is not terminated properly and
> tries to restart the job over and over again.
>
> This is fixed for 1.1.0. Is it an option for you to upgrade to 1.1
> when it's out? We might need to backport this fix for 1.0.4. The work
> around is as I've described, just restart jobmanager-5.
>
>
>
> On Wed, Jul 27, 2016 at 2:55 PM, Hironori Ogibayashi
> <ogibaya...@gmail.com> wrote:
>> Thank you so much for your quick response.
>> I am running Flink 1.0.3.
>>
>> I have attached jobmanager logs. The failover happend around 7/26 21:13 JST.
>>
>> Regards,
>> Hironori
>>
>> 2016-07-27 21:21 GMT+09:00 Ufuk Celebi <u...@apache.org>:
>>> Which version of Flink are you running on? I think this might have
>>> been fixed for the 1.1 release
>>> (http://people.apache.org/~uce/flink-1.1.0-rc1/).
>>>
>>> It looks like the ExecutionGraph is still trying to restart although
>>> the JobManager is not the leader anymore. If you could provide the
>>> complete logs of both JobManagers, that would be helpful to be sure
>>> what is happening.
>>>
>>> You can recover from this by restarting the respective JobManager
>>> process (by running "jobmanager.sh stop" script on that machine and
>>> starting again via "jobmanager.sh start cluster") .
>>>
>>> – Ufuk
>>>
>>> On Wed, Jul 27, 2016 at 2:00 PM, Hironori Ogibayashi
>>> <ogibaya...@gmail.com> wrote:
>>>> Hello,
>>>>
>>>> I have standalone Flink cluster with JobManager HA.
>>>> Last night, JobManager failovered because of the connection timeout to
>>>> Zookeeper.
>>>> Job is successfully running under new leader JobManager, but when
>>>> I see the old leader JobManager log, it is trying to re-submit job and
>>>> getting errors. ( for almost 24 hours now)
>>>>
>>>> Here is the log.
>>>>
>>>> -----
>>>> 2016-07-27 20:56:09,218 WARN
>>>> org.apache.flink.runtime.jobmanager.JobManager                -
>>>> Discard message
>>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>>>> 20:56:09     Job execution switched to status RESTARTING.) because the
>>>> expected leader session ID None did not equal the received leader
>>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>>>> 2016-07-27 20:56:19,218 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>> - Recovering checkpoints from ZooKeeper.
>>>> 2016-07-27 20:56:19,218 WARN
>>>> org.apache.flink.runtime.jobmanager.JobManager                -
>>>> Discard message
>>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>>>> 20:56:19     Job execution switched to status CREATED.) because the
>>>> expected leader session ID None did not equal the received leader
>>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>>>> 2016-07-27 20:56:19,219 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>> - Found 1 checkpoints in ZooKeeper.
>>>> 2016-07-27 20:56:19,221 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>> - Initialized with Checkpoint 40229 @ 1469620528216 for
>>>> 978ef000cca5a3aa6f3461274102f82c. Removing all older checkpoints.
>>>> 2016-07-27 20:56:19,222 WARN
>>>> org.apache.flink.runtime.jobmanager.JobManager                -
>>>> Discard message
>>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>>>> 20:56:19     Job execution switched to status RUNNING.) because the
>>>> expected leader session ID None did not equal the received leader
>>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>>>> 2016-07-27 20:56:19,222 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0)
>>>> switched from CREATED to SCHEDULED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0)
>>>> switched from SCHEDULED to CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>> Source: Custom Source (2/3) (4c795c671ec7b548b5faac5b141c331c)
>>>> switched from CREATED to CANCELED
>>>> 2016-07-27 20:56:19,223 WARN
>>>> org.apache.flink.runtime.jobmanager.JobManager                -
>>>> Discard message
>>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>>>> 20:56:19     Job execution switched to status FAILING.) because the
>>>> expected leader session ID None did not equal the received leader
>>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>> Source: Custom Source (3/3) (fce3b243e5b25041aafabbd93a266dbc)
>>>> switched from CREATED to CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>> Source: Custom Source (1/3) (e1e5154f506901539e12b0fe8c140503)
>>>> switched from CREATED to CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>> Source: Custom Source (2/3) (f95eb0ad8fcc50e6bb9046e8700e8778)
>>>> switched from CREATED to CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>> Source: Custom Source (3/3) (0e30de47933282533cf6dda3a22e7ddc)
>>>> switched from CREATED to CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat
>>>> Map (1/3) (ea260b7740d4ac8262c6500429b0ee6b) switched from CREATED to
>>>> CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat
>>>> Map (2/3) (cc5ab4fc296238d32078d2b4a8eb5062) switched from CREATED to
>>>> CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat
>>>> Map (3/3) (9694ae32fc12ec416197308f6a8cb3c1) switched from CREATED to
>>>> CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>> TriggerWindow(GlobalWindows(),
>>>> FoldingStateDescriptor{name=window-contents,
>>>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
>>>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
>>>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
>>>> Filter -> Sink: Unnamed (1/3) (9c6b27873b6ddec58ce3f82f62632152)
>>>> switched from CREATED to CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>> TriggerWindow(GlobalWindows(),
>>>> FoldingStateDescriptor{name=window-contents,
>>>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
>>>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
>>>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
>>>> Filter -> Sink: Unnamed (2/3) (47442827157e04f7e1936ec1d5c876e9)
>>>> switched from CREATED to CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>> TriggerWindow(GlobalWindows(),
>>>> FoldingStateDescriptor{name=window-contents,
>>>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
>>>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
>>>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
>>>> Filter -> Sink: Unnamed (3/3) (a1436ef922932ffbb38f5c23684a43ec)
>>>> switched from CREATED to CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy
>>>>  - Delaying retry of job execution for 10000 ms ...
>>>> 2016-07-27 20:56:19,223 WARN
>>>> org.apache.flink.runtime.jobmanager.JobManager                -
>>>> Discard message
>>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>>>> 20:56:19     Job execution switched to status RESTARTING.) because the
>>>> expected leader session ID None did not equal the received leader
>>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>>>> ----
>>>>
>>>> Could anyone advise me why this happens and how I can recover from
>>>> this situation? (restart JobManager?)
>>>>
>>>> Regards,
>>>> Hironori Ogibayashi

Re: JobManager trying to re-submit jobs after failover

Reply via email to