You should begin by trying to identify the cause of the backpressure, because the appropriate fix depends on the details.
Possible causes that I have seen include: - the job is inadequately provisioned - blocking i/o is being done in a user function - a huge number of timers are firing simultaneously - event time skew between different sources is causing large amounts of state to be buffered - data skew (a hot key) is overwhelming one subtask or slot - external systems can't keep up (e.g., a sink) - lengthy GC pauses caused by running lots of slots per TM with the FsStateBackend - contention for critical resources (e.g., using a NAS as the local disk for RocksDB) Unaligned checkpoints [1], new in Flink 1.11, should address this problem in some cases, depending on the root cause. But first you should try to figure out why you have high backpressure, because a number of the causes listed above won't be helped by changing to unaligned checkpoints. Best, David [1] https://flink.apache.org/news/2020/07/06/release-1.11.0.html#unaligned-checkpoints-beta On Wed, Aug 26, 2020 at 6:06 PM Hubert Chen <hubertche...@gmail.com> wrote: > Hello, > > My Flink application has entered into a bad state and I was wondering if I > could get some advice on how to resolve the issue. > > The sequence of events that led to a bad state: > > 1. A failure occurs (e.g., TM timeout) within the cluster > 2. The application successfully recovers from the last completed checkpoint > 3. The application consumes events from Kafka as quickly as it can. This > introduces high backpressure. > 4. A checkpoint is triggered > 5. Another failure occurs (e.g., TM timeout, checkpoint timeout, Kafka > transaction timeout) and the application loops back to step #2. This > creates a vicious cycle where no progress is made. > > I believe the underlying issue is the application experiencing high > backpressure. This can cause the TM to not respond to heartbeats or cause > long checkpoint durations due to delayed processing of the checkpoint. > > I'm confused on the best next steps to take. How do I ensure that > heartbeats and checkpoints successfully complete when experiencing > inevitable high packpressure? > > Best, > Hubert >