Re: Flink Checkpoint times out with checkpointed data size doubles every checkpoint.

2023-06-20 Thread Prabhu Joseph
Thanks, Shammon and Alex, for the pointers. The Rocsdb state backend is
being used but without an incremental checkpoint. I will enable incremental
checkpoints and see if it works. Thanks.


On Tue, Jun 20, 2023 at 5:25 PM Shammon FY  wrote:

> Hi Prabhu,
>
> I found that the size of `Full Checkpoint Data Size` is equal to
> `Checkpointed Data Size`. So what's the state backend you are using? I
> recommend you to use rocksdb state backed for your job, and if so, you can
> turn on incremental checkpoint [1] which will reduce the state size for the
> checkpoint.
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/large_state_tuning/#incremental-checkpoints
>
> Best,
> Shammon FY
>
> On Tue, Jun 20, 2023 at 4:50 PM Alex Nitavsky 
> wrote:
>
>> Hello Prabhu,
>>
>> On your place I would check:
>>
>> 1. That there is no "state leak" in your job, because it seems that state
>> only accumulates for the job and is never cleaned, e.g. probably some timer
>> which cleans the state for some key is not configured correctly.
>>
>> 2. Probably you accumulate the state in a big window, e.g. in a 2 hour
>> Tumbling window the maximum job state will be reached in two hours only. So
>> your job should be scaled or optimized.
>>
>> Best
>> Alex
>>
>> On Tue, Jun 20, 2023 at 10:39 AM Prabhu Joseph <
>> prabhujose.ga...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Flink Checkpoint times out with checkpointed data size doubles every
>>> checkpoint. Any ideas on what could be wrong in the application or how to
>>> debug this?
>>>
>>> [image: checkpoint_issue.png]
>>>
>>>
>>>


Re: Flink Checkpoint times out with checkpointed data size doubles every checkpoint.

2023-06-20 Thread Shammon FY
Hi Prabhu,

I found that the size of `Full Checkpoint Data Size` is equal to
`Checkpointed Data Size`. So what's the state backend you are using? I
recommend you to use rocksdb state backed for your job, and if so, you can
turn on incremental checkpoint [1] which will reduce the state size for the
checkpoint.

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/large_state_tuning/#incremental-checkpoints

Best,
Shammon FY

On Tue, Jun 20, 2023 at 4:50 PM Alex Nitavsky 
wrote:

> Hello Prabhu,
>
> On your place I would check:
>
> 1. That there is no "state leak" in your job, because it seems that state
> only accumulates for the job and is never cleaned, e.g. probably some timer
> which cleans the state for some key is not configured correctly.
>
> 2. Probably you accumulate the state in a big window, e.g. in a 2 hour
> Tumbling window the maximum job state will be reached in two hours only. So
> your job should be scaled or optimized.
>
> Best
> Alex
>
> On Tue, Jun 20, 2023 at 10:39 AM Prabhu Joseph 
> wrote:
>
>> Hi,
>>
>> Flink Checkpoint times out with checkpointed data size doubles every
>> checkpoint. Any ideas on what could be wrong in the application or how to
>> debug this?
>>
>> [image: checkpoint_issue.png]
>>
>>
>>


Re: Flink Checkpoint times out with checkpointed data size doubles every checkpoint.

2023-06-20 Thread Alex Nitavsky
Hello Prabhu,

On your place I would check:

1. That there is no "state leak" in your job, because it seems that state
only accumulates for the job and is never cleaned, e.g. probably some timer
which cleans the state for some key is not configured correctly.

2. Probably you accumulate the state in a big window, e.g. in a 2 hour
Tumbling window the maximum job state will be reached in two hours only. So
your job should be scaled or optimized.

Best
Alex

On Tue, Jun 20, 2023 at 10:39 AM Prabhu Joseph 
wrote:

> Hi,
>
> Flink Checkpoint times out with checkpointed data size doubles every
> checkpoint. Any ideas on what could be wrong in the application or how to
> debug this?
>
> [image: checkpoint_issue.png]
>
>
>


Flink Checkpoint times out with checkpointed data size doubles every checkpoint.

2023-06-20 Thread Prabhu Joseph
Hi,

Flink Checkpoint times out with checkpointed data size doubles every
checkpoint. Any ideas on what could be wrong in the application or how to
debug this?

[image: checkpoint_issue.png]