Re: Flink Checkpoint times out with checkpointed data size doubles every checkpoint.
Thanks, Shammon and Alex, for the pointers. The Rocsdb state backend is being used but without an incremental checkpoint. I will enable incremental checkpoints and see if it works. Thanks. On Tue, Jun 20, 2023 at 5:25 PM Shammon FY wrote: > Hi Prabhu, > > I found that the size of `Full Checkpoint Data Size` is equal to > `Checkpointed Data Size`. So what's the state backend you are using? I > recommend you to use rocksdb state backed for your job, and if so, you can > turn on incremental checkpoint [1] which will reduce the state size for the > checkpoint. > > [1] > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/large_state_tuning/#incremental-checkpoints > > Best, > Shammon FY > > On Tue, Jun 20, 2023 at 4:50 PM Alex Nitavsky > wrote: > >> Hello Prabhu, >> >> On your place I would check: >> >> 1. That there is no "state leak" in your job, because it seems that state >> only accumulates for the job and is never cleaned, e.g. probably some timer >> which cleans the state for some key is not configured correctly. >> >> 2. Probably you accumulate the state in a big window, e.g. in a 2 hour >> Tumbling window the maximum job state will be reached in two hours only. So >> your job should be scaled or optimized. >> >> Best >> Alex >> >> On Tue, Jun 20, 2023 at 10:39 AM Prabhu Joseph < >> prabhujose.ga...@gmail.com> wrote: >> >>> Hi, >>> >>> Flink Checkpoint times out with checkpointed data size doubles every >>> checkpoint. Any ideas on what could be wrong in the application or how to >>> debug this? >>> >>> [image: checkpoint_issue.png] >>> >>> >>>
Re: Flink Checkpoint times out with checkpointed data size doubles every checkpoint.
Hi Prabhu, I found that the size of `Full Checkpoint Data Size` is equal to `Checkpointed Data Size`. So what's the state backend you are using? I recommend you to use rocksdb state backed for your job, and if so, you can turn on incremental checkpoint [1] which will reduce the state size for the checkpoint. [1] https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/large_state_tuning/#incremental-checkpoints Best, Shammon FY On Tue, Jun 20, 2023 at 4:50 PM Alex Nitavsky wrote: > Hello Prabhu, > > On your place I would check: > > 1. That there is no "state leak" in your job, because it seems that state > only accumulates for the job and is never cleaned, e.g. probably some timer > which cleans the state for some key is not configured correctly. > > 2. Probably you accumulate the state in a big window, e.g. in a 2 hour > Tumbling window the maximum job state will be reached in two hours only. So > your job should be scaled or optimized. > > Best > Alex > > On Tue, Jun 20, 2023 at 10:39 AM Prabhu Joseph > wrote: > >> Hi, >> >> Flink Checkpoint times out with checkpointed data size doubles every >> checkpoint. Any ideas on what could be wrong in the application or how to >> debug this? >> >> [image: checkpoint_issue.png] >> >> >>
Re: Flink Checkpoint times out with checkpointed data size doubles every checkpoint.
Hello Prabhu, On your place I would check: 1. That there is no "state leak" in your job, because it seems that state only accumulates for the job and is never cleaned, e.g. probably some timer which cleans the state for some key is not configured correctly. 2. Probably you accumulate the state in a big window, e.g. in a 2 hour Tumbling window the maximum job state will be reached in two hours only. So your job should be scaled or optimized. Best Alex On Tue, Jun 20, 2023 at 10:39 AM Prabhu Joseph wrote: > Hi, > > Flink Checkpoint times out with checkpointed data size doubles every > checkpoint. Any ideas on what could be wrong in the application or how to > debug this? > > [image: checkpoint_issue.png] > > >
Flink Checkpoint times out with checkpointed data size doubles every checkpoint.
Hi, Flink Checkpoint times out with checkpointed data size doubles every checkpoint. Any ideas on what could be wrong in the application or how to debug this? [image: checkpoint_issue.png]