Hi,

@Joshua Fan <joshuafat...@gmail.com>, I think the logs you provided me
substantiated my suspicion. I think you are running into FLINK-11843. It
happens the following way: After regaining the leadership, the Dispatcher
tries to start the JobManager. Recovering the state of the job takes a
while (could also be the temporary connection loss to ZooKeeper) so that
the second revocation of the leadership happens during the checkpoint
recovery. Due to this loss, the start of the JobManager cannot be
completed. However, we need this operation to complete in order to start
the second job recovery after regaining the leadership for the second time.
This problem will be fixed with FLINK-11843. This problem will be fixed
with Flink 1.10. The fix should be merged next week so that you could try
out whether this problem still occurs then.

@Hanson, Bruce <bruce.han...@here.com>, the problem seems to be the
following: You start two JobManagers. Technically this means that two
ClusterEntrypoints will be started. In each ClusterEntrypoint a Dispatcher,
ResourceManager, RestServerEndpoint and potentially multiple JobManagers
can be executed. In order to figure out which component is active we do
leader election. In your current setup, the Dispatcher1 (running in the
first ClusterEntrypoint process) and ResourceManager2 (running in the
second ClusterEntrypoint process) gain leadership. Since Dispatcher1 only
can talk to ResourceManager1, the cluster does not accept job submissions
and the Dispatcher cannot serve the cluster overview. The problem has been
fixed with Flink 1.8.0. Hence, I would suggest to upgrade to a newer Flink
version where the problem should no longer occur.

Cheers,
Till

On Fri, Oct 11, 2019 at 6:18 AM Joshua Fan <joshuafat...@gmail.com> wrote:

> Sorry to forget the version, we run flink 1.7 on yarn in a ha mode.
>
> On Fri, Oct 11, 2019 at 12:02 PM Joshua Fan <joshuafat...@gmail.com>
> wrote:
>
>> Hi Till
>>
>> After got your advice, I checked the log again. It seems not wholely the
>> same as the condition you mentioned.
>>
>> I would like to summarize the story in the belowed log.
>>
>> Once a time, the zk connection  was not stable, so there happened 3 times
>> suspended-reconnected.
>>
>> After the first suspended-reconnected, the Minidispatcher tried to
>> recover all jobs.
>>
>> Then the second suspended-reconnected came, after this reconnected, there
>> happened a 'The heartbeat of JobManager with id
>> dbad79e0173c5658b029fba4d70e8084 timed out', and in this turn, the
>> Minidispatcher didn't try to recover the job.
>>
>> Due to the zk connection did not recover, the third suspended-reconnected
>> came, after the zk reconnected, the Minidispatcher did not try to recover
>> job ,but just repeated throw FencingTokenException, the AM was hanging, our
>> alarm-system just
>> found the job was gone, but can not get a final state of the job. And the
>> FencingTokenException was ongoing for nearly one day long before we killed
>> the AM.
>>
>> the whole log is attached.
>>
>> Thanks
>>
>> Joshua
>>
>> On Fri, Oct 11, 2019 at 10:35 AM Hanson, Bruce <bruce.han...@here.com>
>> wrote:
>>
>>> Hi Till and Fabian,
>>>
>>>
>>>
>>> My apologies for taking a week to reply; it took some time to reproduce
>>> the issue with debug logging. I’ve attached logs from a two minute period
>>> when the problem happened. I’m just sending this to you two to avoid
>>> sending the log file all over the place. If you’d like to have our
>>> conversation in the user group mailing list, that’s fine.
>>>
>>>
>>>
>>> The job was submitted by using the job manager REST api starting at
>>> 20:33:46.262 and finishing at 20:34:01.547. This worked normally, and the
>>> job started running. We then run a monitor that polls the /overview
>>> endpoint of the JM REST api. This started polling at 20:34:31.380 and
>>> resulted in the JM throwing the FencingTokenException at 20:34:31:393, and
>>> the JM returned a 500 to our monitor. This will happen every time we poll
>>> until the monitor times out and then we tear down the cluster, even though
>>> the job is running, we can’t tell that it is. This is somewhat rare,
>>> happening maybe 5% of the time.
>>>
>>>
>>>
>>> We’re running Flink 1.7.1. This issue only happens when we run in Job
>>> Manager High Availability mode. We provision two Job Managers, a 3-node
>>> zookeeper cluster, task managers and our monitor all in their own
>>> Kubernetes namespace. I can send you Zookeeper logs too if that would be
>>> helpful.
>>>
>>>
>>>
>>> Thanks in advance for any help you can provide!
>>>
>>>
>>>
>>> -Bruce
>>>
>>> --
>>>
>>>
>>>
>>>
>>>
>>> *From: *Till Rohrmann <trohrm...@apache.org>
>>> *Date: *Wednesday, October 2, 2019 at 6:10 AM
>>> *To: *Fabian Hueske <fhue...@gmail.com>
>>> *Cc: *"Hanson, Bruce" <bruce.han...@here.com>, "user@flink.apache.org" <
>>> user@flink.apache.org>
>>> *Subject: *Re: Fencing token exceptions from Job Manager High
>>> Availability mode
>>>
>>>
>>>
>>> Hi Bruce, are you able to provide us with the full debug logs? From the
>>> excerpt itself it is hard to tell what is going on.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Till
>>>
>>>
>>>
>>> On Wed, Oct 2, 2019 at 2:24 PM Fabian Hueske <fhue...@gmail.com> wrote:
>>>
>>> Hi Bruce,
>>>
>>>
>>>
>>> I haven't seen such an exception yet, but maybe Till (in CC) can help.
>>>
>>>
>>>
>>> Best,
>>>
>>> Fabian
>>>
>>>
>>>
>>> Am Di., 1. Okt. 2019 um 05:51 Uhr schrieb Hanson, Bruce <
>>> bruce.han...@here.com>:
>>>
>>> Hi all,
>>>
>>>
>>>
>>> We are running some of our Flink jobs with Job Manager High
>>> Availability. Occasionally we get a cluster that comes up improperly and
>>> doesn’t respond. Attempts to submit the job seem to hang and when we hit
>>> the /overview REST endpoint in the Job Manager we get a 500 error and a
>>> fencing token exception like this:
>>>
>>>
>>>
>>> *2019-09-21 05:04:07.785 [flink-akka.actor.default-dispatcher-4428]
>>> level=ERROR o.a.f.runtime.rest.handler.cluster.ClusterOverviewHandler  -
>>> Implementation error: Unhandled exception.*
>>>
>>> *org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing
>>> token not set: Ignoring message LocalFencedMessage(null,
>>> LocalRpcInvocation(requestResourceOverview(Time))) sent to
>>> akka.tcp://fl...@job-ef80a156-3350-4e85-8761-b0e42edc346f-jm-0.job-ef80a156-3350-4e85-8761-b0e42edc346f-jm-svc.olp-here-test-j-ef80a156-3350-4e85-8761-b0e42edc346f.svc.cluster.local:6126/user/resourcemanager
>>> because the fencing token is null.*
>>>
>>> *        at
>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:59)*
>>>
>>> *        at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)*
>>>
>>> *        at
>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)*
>>>
>>> *        at
>>> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)*
>>>
>>> *        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)*
>>>
>>> *        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)*
>>>
>>> *        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)*
>>>
>>> *        at akka.actor.ActorCell.invoke(ActorCell.scala:495)*
>>>
>>> *        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)*
>>>
>>> *        at akka.dispatch.Mailbox.run(Mailbox.scala:224)*
>>>
>>> *        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)*
>>>
>>> *        at
>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)*
>>>
>>> *        at
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)*
>>>
>>> *        at
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)*
>>>
>>> *        at
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)*
>>>
>>>
>>>
>>>
>>>
>>> We are running Flink 1.7.1 in Kubernetes and run each job in its own
>>> namespace with a three-node Zookeeper cluster and two Job Managers, plus
>>> one or more Task Managers. I have been able to replicate the issue, but
>>> don’t find any difference in the logs between a failing cluster and a good
>>> one.
>>>
>>>
>>>
>>> Does anyone here have any ideas as to what’s happening, or what I should
>>> be looking for?
>>>
>>>
>>>
>>> -Bruce
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> [image: cid:image001.png@01D2B473.0F7F85E0]
>>>
>>>
>>>
>>> *Bruce Hanson*
>>>
>>> *Principal Engineer*
>>>
>>> *M: +1 425 681 0422*
>>>
>>>
>>>
>>> HERE Technologies
>>>
>>> 701 Pike Street, Suite 2000
>>>
>>> Seattle, WA 98101 USA
>>>
>>> *47° 36' 41" N 122° 19' 57" W*
>>>
>>>
>>>
>>> [image: cid:image002.png@01D2B473.0F7F85E0] <http://360.here.com/>    
>>> [image:
>>> cid:image003.png@01D2B473.0F7F85E0]
>>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.twitter.com%2Fhere&data=01%7C01%7C%7C4f67a659ffdf4a9aa6ed08d74739d553%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=nRHXS3zhj3%2B9yNIPJOdXrsPuSMOvVKIhJIzXqS1aF14%3D&reserved=0>
>>>    [image: cid:image004.png@01D2B473.0F7F85E0]
>>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.facebook.com%2Fhere&data=01%7C01%7C%7C4f67a659ffdf4a9aa6ed08d74739d553%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=7X4UVPB6mfZinwV9kpVLiiLNc23DpmP558jh5ObRbAE%3D&reserved=0>
>>>     [image: cid:image005.png@01D2B473.0F7F85E0]
>>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fheremaps&data=01%7C01%7C%7C4f67a659ffdf4a9aa6ed08d74739d553%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=i7DJedE1zXA2or4XXG5xyJkMynSaln0OVbAruhQjuYc%3D&reserved=0>
>>>     [image: cid:image006.png@01D2B473.0F7F85E0]
>>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.instagram.com%2Fhere%2F&data=01%7C01%7C%7C4f67a659ffdf4a9aa6ed08d74739d553%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=oKDxDaJePFZ6MnPsHU%2FXprRNetgawPvdX%2BzRmo8tcVo%3D&reserved=0>
>>>
>>>
>>>
>>>
>>>
>>>

Reply via email to