Hi Sonam,

Flink uses its own heartbeat implementation to detect failures of
components. This mechanism is independent of the used deployment model. The
relevant configuration options can be found here [1].

The akka.transport.* options are only for configuring the underlying Akka
system. Since we are using TCP Akka's failure detector is not needed [2]. I
think we should remove it in order to avoid confusion [3].

The community also thinks about improving the failure detection mechanism
because in some deployment scenarios we have additional signals available
which could help us with the detection. But so far we haven't made a lot of
progress in this area.

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/deployment/config.html#advanced-fault-tolerance-options
[2]
https://doc.akka.io/docs/akka-enhancements/current/config-checker.html#transport-failure-detector
[3] https://issues.apache.org/jira/browse/FLINK-22048

Cheers,
Till

On Mon, Mar 29, 2021 at 11:01 PM Sonam Mandal <[email protected]> wrote:

> Hello,
>
> I'm looking for some resources around failure detection in Flink between
> the various components such as Task Manager, Job Manager, Resource Manager,
> etc. For example, how does the Job Manager detect that a Task Manager is
> down (long GC pause or it just crashed)?
>
> There is some indication of the use of heartbeats, is this via Akka death
> watches or custom heartbeat implementation? Reason I ask is because some
> configurations for timeout are AKKA related, whereas others aren't. I would
> like to understand which timeouts are relevant to which pieces.
>
> e.g. akka.transport.heartbeat.interval vs. heartbeat.interval
> I see some earlier posts that mention akka.watch.heartbeat.interval,
> though this is not present on the latest configuration page for Flink.
>
> Also, is this failure detection mechanism the same irrespective of the
> deployment model, i.e. Kubernetes/Yarn/Mesos?
>
> Thanks,
> Sonam
>
>
>

Reply via email to