Hi Dominik, all jobs on the cluster (batch only jobs without state) where in status FINISHED.
Best, Helmut On Fri, Aug 17, 2018 at 8:04 PM Dominik Wosiński <wos...@gmail.com> wrote: > I have faced this issue, but in 1.4.0 IIRC. This seems to be related to > https://issues.apache.org/jira/browse/FLINK-10011. What was the status of > the jobs when the main Job Manager has been stopped ? > > 2018-08-17 17:08 GMT+02:00 Helmut Zechmann <hel...@adeven.com>: > >> Hi all, >> >> we have a problem with flink 1.5.2 high availability in standalone mode. >> >> We have two jobmanagers running. When I shut down the main job manager, >> the failover job manager encounters an error during failover. >> >> Logs: >> >> >> 2018-08-17 14:38:16,478 WARN akka.remote.ReliableDeliverySupervisor >> - Association with remote system [akka.tcp:// >> fl...@seg-1.adjust.com:29095] has failed, address is now gated for [50] >> ms. Reason: [Disassociated] >> 2018-08-17 14:38:31,449 WARN akka.remote.transport.netty.NettyTransport >> - Remote connection to [null] failed with >> java.net.ConnectException: Connection refused: >> seg-1.adjust.com/178.162.219.66:29095 >> 2018-08-17 <http://seg-1.adjust.com/178.162.219.66:29095%0D2018-08-17> >> 14:38:31,451 WARN akka.remote.ReliableDeliverySupervisor >> - Association with remote system [akka.tcp:// >> fl...@seg-1.adjust.com:29095] has failed, address is now gated for [50] >> ms. Reason: [Association failed with [akka.tcp:// >> fl...@seg-1.adjust.com:29095]] Caused by: [Connection refused: >> seg-1.adjust.com/178.162.219.66:29095] >> 2018-08-17 14:38:41,379 ERROR >> org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler >> - Could not retrieve the redirect address. >> java.util.concurrent.CompletionException: >> akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp:// >> fl...@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] after [10000 >> ms]. Sender[null] sent message of type >> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". >> at >> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) >> [... shortened ...] >> Caused by: akka.pattern.AskTimeoutException: Ask timed out on >> [Actor[akka.tcp:// >> fl...@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] after [10000 >> ms]. Sender[null] sent message of type >> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". >> at >> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) >> ... 9 more >> 2018-08-17 14:38:48,005 INFO >> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - >> http://seg-2.adjust.com:8083 was granted leadership with >> leaderSessionID=708d1a64-c353-448b-9101-7eb3f910970e >> 2018-08-17 14:38:48,005 INFO >> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - >> ResourceManager akka.tcp:// >> fl...@seg-2.adjust.com:30169/user/resourcemanager was granted leadership >> with fencing token 8de829de14876a367a80d37194b944ee >> 2018-08-17 14:38:48,006 INFO >> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - >> Starting the SlotManager. >> 2018-08-17 14:38:48,007 INFO >> org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher >> akka.tcp://fl...@seg-2.adjust.com:30169/user/dispatcher was granted >> leadership with fencing token 684f50f8-327c-47e1-a53c-931c4f4ea3e5 >> 2018-08-17 14:38:48,007 INFO >> org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering >> all persisted jobs. >> 2018-08-17 14:38:48,021 INFO >> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - >> Recovered SubmittedJobGraph(b951bbf518bcf6cc031be6d2ccc441bb, null). >> 2018-08-17 14:38:48,028 INFO >> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - >> Recovered SubmittedJobGraph(06ed64f48fa0a7cffde53b99cbaa073f, null). >> 2018-08-17 14:38:48,035 ERROR >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error >> occurred in the cluster entrypoint. >> java.lang.RuntimeException: >> org.apache.flink.runtime.client.JobExecutionException: Could not set up >> JobManager >> at >> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) >> [... shortened ...] >> Caused by: org.apache.flink.runtime.client.JobExecutionException: Could >> not set up JobManager >> at >> org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176) >> at >> org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936) >> at >> org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:291) >> at >> org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:281) >> at >> org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:38) >> ... 21 more >> Caused by: java.lang.Exception: Cannot set up the user code libraries: >> /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb >> (No such file or directory) >> at >> org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:134) >> ... 25 more >> Caused by: java.io.FileNotFoundException: >> /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb >> (No such file or directory) >> at java.io.FileInputStream.open0(Native Method) >> [... shortened ...] >> ... 25 more >> 2018-08-17 14:38:48,036 INFO >> org.apache.flink.runtime.blob.TransientBlobCache - Shutting >> down BLOB cache >> 2018-08-17 14:38:48,038 INFO org.apache.flink.runtime.blob.BlobServer >> - Stopped BLOB server at 0.0.0.0:27073 >> >> >> Our HA config is: >> >> >> high-availability: zookeeper >> high-availability.cluster-id: 1.5-batch >> high-availability.storageDir: >> file:///var/lib/flink/ceph//prod/1.5-batch/ha_state >> high-availability.zookeeper.path.root: /1.5-batch >> high-availability.zookeeper.quorum: kafka-4:2181,kafka-5:2181,kafka-6:2181 >> >> >> Any ideas what might be the probleme here? >> >> >> Best, >> >> Helmut > > >