Hi Harshith, No, you don't need to restart the whole cluster. Flink only needs enough processing slots to recover the job. If you have a standby TM, the job should restart immediately (according to its restart policy). Otherwise, you have to start a new TM to provide more slots. Once the slots are registered, the job recovers.
Best, Fabian Am Fr., 18. Jan. 2019 um 10:53 Uhr schrieb Kumar Bolar, Harshith < hk...@arity.com>: > Hi all, > > > > We're running a standalone Flink cluster with 2 Job Managers and 3 Task > Managers. Whenever a TM crashes, we simply restart that particular TM and > proceed with the processing. > > > > But reading the comments on this > <https://stackoverflow.com/questions/54149134/what-happen-to-state-in-flink-task-manager-when-crash> > question > makes it look like we need to restart all the 5 nodes that form a cluster > to deal with the failure of a single TM. Am I reading this right? What > would be the consequences if we restart just the crashed TM and let the > healthy ones run as is? > > > > Thanks, > > Harshith > > >