Hi Till,

Thanks, this helps! Yes, removing the AKKA related configs will definitely help 
to reduce confusion.

One more question, I was going through FLIP-6 and it does talk about the 
behavior of various components when failures are detected via heartbeat 
timeouts etc. is this the best reference on how Flink reacts to such failure 
scenarios? If not, can you provide some details on how this works?

Thanks,
Sonam

Get Outlook for iOS<https://aka.ms/o0ukef>
________________________________
From: Till Rohrmann <[email protected]>
Sent: Tuesday, March 30, 2021 5:02:43 AM
To: Sonam Mandal <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: Failure detection in Flink

Hi Sonam,

Flink uses its own heartbeat implementation to detect failures of components. 
This mechanism is independent of the used deployment model. The relevant 
configuration options can be found here [1].

The akka.transport.* options are only for configuring the underlying Akka 
system. Since we are using TCP Akka's failure detector is not needed [2]. I 
think we should remove it in order to avoid confusion [3].

The community also thinks about improving the failure detection mechanism 
because in some deployment scenarios we have additional signals available which 
could help us with the detection. But so far we haven't made a lot of progress 
in this area.

[1] 
https://ci.apache.org/projects/flink/flink-docs-stable/deployment/config.html#advanced-fault-tolerance-options<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-stable%2Fdeployment%2Fconfig.html%23advanced-fault-tolerance-options&data=04%7C01%7Csomandal%40linkedin.com%7Ceca3936cc7ad4027255c08d8f373c55a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637527025850803513%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dtAezfRlQqrqrR1Ukw8PI71whGFeOzIn82b1RJrmcM0%3D&reserved=0>
[2] 
https://doc.akka.io/docs/akka-enhancements/current/config-checker.html#transport-failure-detector<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoc.akka.io%2Fdocs%2Fakka-enhancements%2Fcurrent%2Fconfig-checker.html%23transport-failure-detector&data=04%7C01%7Csomandal%40linkedin.com%7Ceca3936cc7ad4027255c08d8f373c55a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637527025850813507%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZgNg1wjoy9f2%2BYjVcPY9obdt%2BnSMf4udt9BO8FN1c80%3D&reserved=0>
[3] 
https://issues.apache.org/jira/browse/FLINK-22048<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FFLINK-22048&data=04%7C01%7Csomandal%40linkedin.com%7Ceca3936cc7ad4027255c08d8f373c55a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637527025850813507%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=LloCATLNcLBOvKnwV2STgCJLCJf42iC5phjpVOBvJRk%3D&reserved=0>

Cheers,
Till

On Mon, Mar 29, 2021 at 11:01 PM Sonam Mandal 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

I'm looking for some resources around failure detection in Flink between the 
various components such as Task Manager, Job Manager, Resource Manager, etc. 
For example, how does the Job Manager detect that a Task Manager is down (long 
GC pause or it just crashed)?

There is some indication of the use of heartbeats, is this via Akka death 
watches or custom heartbeat implementation? Reason I ask is because some 
configurations for timeout are AKKA related, whereas others aren't. I would 
like to understand which timeouts are relevant to which pieces.

e.g. akka.transport.heartbeat.interval vs. heartbeat.interval
I see some earlier posts that mention akka.watch.heartbeat.interval, though 
this is not present on the latest configuration page for Flink.

Also, is this failure detection mechanism the same irrespective of the 
deployment model, i.e. Kubernetes/Yarn/Mesos?

Thanks,
Sonam


Reply via email to