Hi
>From the given picture,
1. there were some checkpoint failed(but not because of timeout), could you
please check why these checkpoint would fail?
2. The checkpoint data size is the delta size for current checkpoint[1],
assume you using incremental checkpoint
3. In fig1 the checkpoint size is ~3G, but in fig 2 the delta size can grow
to ~ 15G, my gut feeling is that the state update/insert ratio for your
program is very high? so that in one checkpoint you'll generate too much
sst files
4. from fig 2 seems you configurate
execution-checkpointing-max-concurrent-checkpoints[2] bigger than 1, could
you please try to set it to 1 and have a try?

[1]
https://ci.apache.org/projects/flink/flink-docs-master/monitoring/checkpoint_monitoring.html#history-tab

[2]
https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#execution-checkpointing-max-concurrent-checkpoints
Best,
Congxian


Slotterback, Chris <chris_slotterb...@comcast.com> 于2020年5月30日周六 上午7:43写道:

> Hi there,
>
>
>
> We are trying to upgrade a flink app from using FsStateBackend to
> RocksDBStateBackend to reduce overhead memory requirements. When enabling
> rocks, we are seeing a drop in used heap memory as it increments to disk,
> but checkpoint durations have become inconsistent. Our data source has a
> stable rate of reports coming in parallelly across partitions. The state
> size doesn’t seem to correlate with the checkpoint duration from what I can
> see in metrics. we have tried tmpfs and swap on SSDs with high iops, but
> can’t get a good handle on what’s causing smaller state to take longer to
> checkpoint. Our checkpoint location is hdfs, and works well in our
> non-rocks cluster.
>
>
>
> Is ~100x checkpoint duration expected when going from fs to rocks state
> backend, and is checkpoint duration supposed to vary this much with a
> consistent data source normally?
>
>
>
> Chris
>

Reply via email to