Hi Rob, yes, this behavior is expected. Flink does not automatically scale-down a job in case of a failure. You have to ensure that you have enough resources available to continue processing. In case of Flink's cluster mode, the common practice is to have stand-by TMs available (the same is true for JMs if you need a HA setup).
Best, Fabian 2017-10-06 13:56 GMT+02:00 r. r. <rob...@abv.bg>: > Hello > I have set up a cluster and added taskmanagers manually with > bin/taskmanager.sh start. > I noticed that if i have 5 task managers with one slot each and start a > job with -p5, then if i stop a taskmanager the job will fail even if there > are 4 more taskmanagers. > > Is this expected (I turned off restart policy)? > So the way to ensure continuous operation of a single "job" is to have > e.g. 10 TM and deploy 10 job instances to fill each of 10 slots? > Or if I have a job that does require -p3 for example, I should always have > at least 3 TM alive? > > Many thanks! > -Rob > >
