Thanks, Shammon and Alex, for the pointers. The Rocsdb state backend is being used but without an incremental checkpoint. I will enable incremental checkpoints and see if it works. Thanks.
On Tue, Jun 20, 2023 at 5:25 PM Shammon FY <zjur...@gmail.com> wrote: > Hi Prabhu, > > I found that the size of `Full Checkpoint Data Size` is equal to > `Checkpointed Data Size`. So what's the state backend you are using? I > recommend you to use rocksdb state backed for your job, and if so, you can > turn on incremental checkpoint [1] which will reduce the state size for the > checkpoint. > > [1] > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/large_state_tuning/#incremental-checkpoints > > Best, > Shammon FY > > On Tue, Jun 20, 2023 at 4:50 PM Alex Nitavsky <alexnitav...@gmail.com> > wrote: > >> Hello Prabhu, >> >> On your place I would check: >> >> 1. That there is no "state leak" in your job, because it seems that state >> only accumulates for the job and is never cleaned, e.g. probably some timer >> which cleans the state for some key is not configured correctly. >> >> 2. Probably you accumulate the state in a big window, e.g. in a 2 hour >> Tumbling window the maximum job state will be reached in two hours only. So >> your job should be scaled or optimized. >> >> Best >> Alex >> >> On Tue, Jun 20, 2023 at 10:39 AM Prabhu Joseph < >> prabhujose.ga...@gmail.com> wrote: >> >>> Hi, >>> >>> Flink Checkpoint times out with checkpointed data size doubles every >>> checkpoint. Any ideas on what could be wrong in the application or how to >>> debug this? >>> >>> [image: checkpoint_issue.png] >>> >>> >>>