Hi James, did you configure checkpointing [1] and a recovery strategy [2] for your job?
Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/state/checkpointing.html [2] https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/restart_strategies.html 2018-01-17 8:10 GMT+01:00 Data Engineer <dataenginee...@gmail.com>: > Hi Nico, > > Thank you for your reply. > > I have configured each TaskManager with 24 available task slots. When both > TaskManagers were running, I could see that a total of 8 task slots were > being used. > I also see that 24 task slots are available after one TaskManager goes > down. I don't see any exception regarding available task slots > > However, I get a java.net.ConnectException each time the JobManager tries > to connect to the TaskManager that I have killed. It retries 3 times(the > number I have set) and then the job fails. > I expect the JobManager to move the workload to the remaining machine on > which TaskManager is running. Or does it expect both TaskManagers to be up > by the time it restarts? > > Regards, > James > > On Tue, Jan 16, 2018 at 3:02 PM, Nico Kruber <n...@data-artisans.com> > wrote: > >> Hi James, >> In this scenario, with the restart strategy set, the job should restart >> (without YARN/Mesos) as long as you have enough slots available. >> >> Can you check with the web interface on http://<jobmanager>:8081/ that >> enough slots are available after killing one TaskManager? >> >> Can you provide JobManager and TaskManager logs and some more details on >> the job you are running? >> >> >> Nico >> >> On 16/01/18 07:04, Data Engineer wrote: >> > This question has been asked on StackOverflow: >> > https://stackoverflow.com/questions/48262080/how-to-get-auto >> matic-fail-over-working-in-flink >> > >> > I am using Apache Flink 1.4 on a cluster of 3 machines, out of which one >> > is the JobManager and the other 2 host TaskManagers. >> > >> > I start flink in cluster mode and submit a flink job. I have configured >> > 24 task slots in the flink config, and for the job I use 6 task slots. >> > >> > When I submit the job, I see 3 tasks are assigned to Worker machine 1 >> > and 3 are assigned to Worker machine 2. Now, when I kill the TaskManager >> > on WorkerMachine 2, I see that the entire job fails. >> > >> > Is this the expected behaviour, or does it have automatic failover as in >> > Spark. >> > >> > Do we need to use YARN/Mesos to achieve automatic failover? >> > >> > We tried the Restart Strategy, but when it restarts we get an exception >> > saying that no task slots are available and then the job fails. We think >> > that 24 slots is enough to take over. What could we be doing wrong here? >> > >> > Regards, >> > James >> >> >