Thanks a lot for clarifying :-) - Harshith
From: Fabian Hueske <fhue...@gmail.com> Date: Friday, 18 January 2019 at 4:31 PM To: Harshith Kumar Bolar <hk...@arity.com> Cc: "user@flink.apache.org" <user@flink.apache.org> Subject: [External] Re: Should the entire cluster be restarted if a single Task Manager crashes? Hi Harshith, No, you don't need to restart the whole cluster. Flink only needs enough processing slots to recover the job. If you have a standby TM, the job should restart immediately (according to its restart policy). Otherwise, you have to start a new TM to provide more slots. Once the slots are registered, the job recovers. Best, Fabian Am Fr., 18. Jan. 2019 um 10:53 Uhr schrieb Kumar Bolar, Harshith <hk...@arity.com<mailto:hk...@arity.com>>: Hi all, We're running a standalone Flink cluster with 2 Job Managers and 3 Task Managers. Whenever a TM crashes, we simply restart that particular TM and proceed with the processing. But reading the comments on this<https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_54149134_what-2Dhappen-2Dto-2Dstate-2Din-2Dflink-2Dtask-2Dmanager-2Dwhen-2Dcrash&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=AV4noYsrqLimkSRw_NueIpdo1K2kduOrXvJdL3ZNiVo&s=KJhRUsmGJu4cxeY9aCjhdOX4iu3DYsqUxjVGqxv9JYw&e=> question makes it look like we need to restart all the 5 nodes that form a cluster to deal with the failure of a single TM. Am I reading this right? What would be the consequences if we restart just the crashed TM and let the healthy ones run as is? Thanks, Harshith