Hello,

I'm looking for some resources around failure detection in Flink between the 
various components such as Task Manager, Job Manager, Resource Manager, etc. 
For example, how does the Job Manager detect that a Task Manager is down (long 
GC pause or it just crashed)?

There is some indication of the use of heartbeats, is this via Akka death 
watches or custom heartbeat implementation? Reason I ask is because some 
configurations for timeout are AKKA related, whereas others aren't. I would 
like to understand which timeouts are relevant to which pieces.

e.g. akka.transport.heartbeat.interval vs. heartbeat.interval
I see some earlier posts that mention akka.watch.heartbeat.interval, though 
this is not present on the latest configuration page for Flink.

Also, is this failure detection mechanism the same irrespective of the 
deployment model, i.e. Kubernetes/Yarn/Mesos?

Thanks,
Sonam


Reply via email to