Re: Failures due to inevitable high backpressure

David Anderson Wed, 26 Aug 2020 10:42:14 -0700

You should begin by trying to identify the cause of the backpressure,
because the appropriate fix depends on the details.


Possible causes that I have seen include:

- the job is inadequately provisioned
- blocking i/o is being done in a user function
- a huge number of timers are firing simultaneously
- event time skew between different sources is causing large amounts of
state to be buffered
- data skew (a hot key) is overwhelming one subtask or slot
- external systems can't keep up (e.g., a sink)
- lengthy GC pauses caused by running lots of slots per TM with the
FsStateBackend
- contention for critical resources (e.g., using a NAS as the local disk
for RocksDB)

Unaligned checkpoints [1], new in Flink 1.11, should address this problem
in some cases, depending on the root cause. But first you should try to
figure out why you have high backpressure, because a number of the causes
listed above won't be helped by changing to unaligned checkpoints.

Best,
David

[1]
https://flink.apache.org/news/2020/07/06/release-1.11.0.html#unaligned-checkpoints-beta

On Wed, Aug 26, 2020 at 6:06 PM Hubert Chen <[email protected]> wrote:

> Hello,
>
> My Flink application has entered into a bad state and I was wondering if I
> could get some advice on how to resolve the issue.
>
> The sequence of events that led to a bad state:
>
> 1. A failure occurs (e.g., TM timeout) within the cluster
> 2. The application successfully recovers from the last completed checkpoint
> 3. The application consumes events from Kafka as quickly as it can. This
> introduces high backpressure.
> 4. A checkpoint is triggered
> 5. Another failure occurs (e.g., TM timeout, checkpoint timeout, Kafka
> transaction timeout) and the application loops back to step #2. This
> creates a vicious cycle where no progress is made.
>
> I believe the underlying issue is the application experiencing high
> backpressure. This can cause the TM to not respond to heartbeats or cause
> long checkpoint durations due to delayed processing of the checkpoint.
>
> I'm confused on the best next steps to take. How do I ensure that
> heartbeats and checkpoints successfully complete when experiencing
> inevitable high packpressure?
>
> Best,
> Hubert
>

Re: Failures due to inevitable high backpressure

Reply via email to