Thanks a lot for clarifying :-)

- Harshith

From: Fabian Hueske <fhue...@gmail.com>
Date: Friday, 18 January 2019 at 4:31 PM
To: Harshith Kumar Bolar <hk...@arity.com>
Cc: "user@flink.apache.org" <user@flink.apache.org>
Subject: [External] Re: Should the entire cluster be restarted if a single Task 
Manager crashes?

Hi Harshith,

No, you don't need to restart the whole cluster. Flink only needs enough 
processing slots to recover the job.
If you have a standby TM, the job should restart immediately (according to its 
restart policy). Otherwise, you have to start a new TM to provide more slots. 
Once the slots are registered, the job recovers.

Best,
Fabian

Am Fr., 18. Jan. 2019 um 10:53 Uhr schrieb Kumar Bolar, Harshith 
<hk...@arity.com<mailto:hk...@arity.com>>:
Hi all,

We're running a standalone Flink cluster with 2 Job Managers and 3 Task 
Managers. Whenever a TM crashes, we simply restart that particular TM and 
proceed with the processing.

But reading the comments on 
this<https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_54149134_what-2Dhappen-2Dto-2Dstate-2Din-2Dflink-2Dtask-2Dmanager-2Dwhen-2Dcrash&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=AV4noYsrqLimkSRw_NueIpdo1K2kduOrXvJdL3ZNiVo&s=KJhRUsmGJu4cxeY9aCjhdOX4iu3DYsqUxjVGqxv9JYw&e=>
 question makes it look like we need to restart all the 5 nodes that form a 
cluster to deal with the failure of a single TM. Am I reading this right? What 
would be the consequences if we restart just the crashed TM and let the healthy 
ones run as is?

Thanks,
Harshith

Reply via email to