One other thought: some users experiencing this have found it preferable to
increase the checkpoint timeout to the point where it is effectively
infinite. Checkpoints that can't timeout are likely to eventually complete,
which is better than landing in the vicious cycle you described.

David

On Wed, Aug 26, 2020 at 7:41 PM David Anderson <da...@alpinegizmo.com>
wrote:

> You should begin by trying to identify the cause of the backpressure,
> because the appropriate fix depends on the details.
>
> Possible causes that I have seen include:
>
> - the job is inadequately provisioned
> - blocking i/o is being done in a user function
> - a huge number of timers are firing simultaneously
> - event time skew between different sources is causing large amounts of
> state to be buffered
> - data skew (a hot key) is overwhelming one subtask or slot
> - external systems can't keep up (e.g., a sink)
> - lengthy GC pauses caused by running lots of slots per TM with the
> FsStateBackend
> - contention for critical resources (e.g., using a NAS as the local disk
> for RocksDB)
>
> Unaligned checkpoints [1], new in Flink 1.11, should address this problem
> in some cases, depending on the root cause. But first you should try to
> figure out why you have high backpressure, because a number of the causes
> listed above won't be helped by changing to unaligned checkpoints.
>
> Best,
> David
>
> [1]
> https://flink.apache.org/news/2020/07/06/release-1.11.0.html#unaligned-checkpoints-beta
>
> On Wed, Aug 26, 2020 at 6:06 PM Hubert Chen <hubertche...@gmail.com>
> wrote:
>
>> Hello,
>>
>> My Flink application has entered into a bad state and I was wondering if
>> I could get some advice on how to resolve the issue.
>>
>> The sequence of events that led to a bad state:
>>
>> 1. A failure occurs (e.g., TM timeout) within the cluster
>> 2. The application successfully recovers from the last completed
>> checkpoint
>> 3. The application consumes events from Kafka as quickly as it can. This
>> introduces high backpressure.
>> 4. A checkpoint is triggered
>> 5. Another failure occurs (e.g., TM timeout, checkpoint timeout, Kafka
>> transaction timeout) and the application loops back to step #2. This
>> creates a vicious cycle where no progress is made.
>>
>> I believe the underlying issue is the application experiencing high
>> backpressure. This can cause the TM to not respond to heartbeats or cause
>> long checkpoint durations due to delayed processing of the checkpoint.
>>
>> I'm confused on the best next steps to take. How do I ensure that
>> heartbeats and checkpoints successfully complete when experiencing
>> inevitable high packpressure?
>>
>> Best,
>> Hubert
>>
>

Reply via email to