Re: Incremental checkpointing & RocksDB Serialization

Yun Tang Fri, 04 Mar 2022 02:07:04 -0800

Hi Vidya,

> Why is the incremental checkpointing taking more time for the snapshot at the 
> end of the window duration?


I guess that this is because the job is under back pressure on end of window. 
You can expand the checkpoint details to see whether that the async duration of 
each task is much slower than the e2e duration? If so, this caused the 
checkpoint barrier stay in the channel longer.

> Why is RocksDB serialization causing the CPU peak?

This is caused by the implementation of your serializer.

> Do you suggest any change in the serializer type in the RocksDB? (Kryo vs 
> Avro)

>From our experience,  kryo is not a good choice in most cases.

Best
Yun Tang
________________________________
From: Vidya Sagar Mula <mulasa...@gmail.com>
Sent: Friday, March 4, 2022 17:00
To: user <user@flink.apache.org>
Subject: Incremental checkpointing & RocksDB Serialization

Hi,

I have a cluster that contains the Flink 1.11 version with AWS - S3 backend. I 
am trying the incremental checkpointing on this set up. I have a pipeline with 
a 10 mins window and incremental checkpointing happens every 2 mins.

Observation:
-------------
I am observing the long duration while taking the snapshot at the end of each 
window, which means every last checkpoint of the window (almost all the times).
I am attaching the Flink UI, checkpoint history.

My set up details:
-------------------
Cluster: Cloud cluster with instance storage.
Memory : 20 GB,
Heap : 10 GB
Flink Managed Memory: 4.5 GB
Flink Version : 1.11
CPUs : 2

ROCKSDB_WRITE_BUFFER_SIZE: "2097152000"  ## 2GB

ROCKSDB_BLOCK_CACHE_SIZE: "104857600"    ## 100 Mb

ROCKSDB_BLOCK_SIZE: "5242880"  ## 5 Mb

ROCKSDB_CHECKPOINT_TRANSFER_THREAD_NUM: 4

ROCKSDB_MAX_BACKGROUND_THREADS: 4


In the analysis, I noticed that the CPU utilization is peaking to almost 100% 
at the time of issue. With further analysis with thread dumps at the time CPU 
peak, it is showing RocksDB serialization related call trace. All the thread 
samples are pointing to this stack.

Based on pipeline transformation class type, RocksDB is choosing Kryo 
Serializer. I did try to change the serializer type, but that is not the focal 
point I want to stress here.

I would like to understand the reason for high CPU utilization. I have tried to 
increase the CPU cycles to 2 and 4. But, it did not give me any better results. 
I have parallelism 2.

Please take a look at the below stack trace. Please suggest me why it is taking 
a lot of CPU at the time of serialize/deserialize in the RocksDB?

########

Stack-1, Stack-2, Stack-3 are attached to this email.

Questions:
-----------
- Why is the incremental checkpointing taking more time for the snapshot at the 
end of the window duration?
- Why is RocksDB serialization causing the CPU peak?
- Do you suggest any change in the serializer type in the RocksDB? (Kryo vs 
Avro)

Thank you,

Re: Incremental checkpointing & RocksDB Serialization

Reply via email to