Hi James, In this scenario, with the restart strategy set, the job should restart (without YARN/Mesos) as long as you have enough slots available.
Can you check with the web interface on http://<jobmanager>:8081/ that enough slots are available after killing one TaskManager? Can you provide JobManager and TaskManager logs and some more details on the job you are running? Nico On 16/01/18 07:04, Data Engineer wrote: > This question has been asked on StackOverflow: > https://stackoverflow.com/questions/48262080/how-to-get-automatic-fail-over-working-in-flink > > I am using Apache Flink 1.4 on a cluster of 3 machines, out of which one > is the JobManager and the other 2 host TaskManagers. > > I start flink in cluster mode and submit a flink job. I have configured > 24 task slots in the flink config, and for the job I use 6 task slots. > > When I submit the job, I see 3 tasks are assigned to Worker machine 1 > and 3 are assigned to Worker machine 2. Now, when I kill the TaskManager > on WorkerMachine 2, I see that the entire job fails. > > Is this the expected behaviour, or does it have automatic failover as in > Spark. > > Do we need to use YARN/Mesos to achieve automatic failover? > > We tried the Restart Strategy, but when it restarts we get an exception > saying that no task slots are available and then the job fails. We think > that 24 slots is enough to take over. What could we be doing wrong here? > > Regards, > James
signature.asc
Description: OpenPGP digital signature