Greetings Paimon community and devs,

I’d like to share some findings from recent performance testing of Paimon
in Flink, on highly partitioned tables with large data volumes.

For PK tables with thousands of partitions and fixed bucket, where all
partitions are receiving streaming writes, we have observed that:
- manifest files are fetched from cloud filesystem rather excessively,
100,000s of times or even more
- manifest fetching can be computationally intensive due to decompression &
deserialization, leading to Flink TaskManager CPU utilization pinned at
100% until manifest fetching are all completed
- this causes very long run times for dedicated compaction jobs, unstable
streaming job write performance, as well as very high API request costs to
cloud filesystems for all the manifest file retrievals

After a lot of logging and debugging, the bottleneck appears to be
FileSystemWriteRestore.restoreFiles
<https://github.com/apache/paimon/blob/release-1.3/paimon-core/src/main/java/org/apache/paimon/operation/FileSystemWriteRestore.java#L80>,
which repeatedly fetches the manifest for each (partition, bucket)
combination, filtering down to only the relevant data files for its own
slice of (partition, bucket).

We have been testing a patch on FileSystemWriteRestore, where we are
pre-fetching and caching the entire manifest to avoid duplicated API
requests and reduce computational burden of repeated
decompression/deserialization.

Draft PR for discussion: https://github.com/apache/paimon/pull/7031
Github issue with some further details:
https://github.com/apache/paimon/issues/7030

I'd like to get some feedback from Paimon maintainers/devs on
- whether this is an acceptable approach / suggestions for alternative
implementation approaches
- are there any caveats/issues that this might cause (e.g. any risk that
may lead to data loss?)

Many thanks,
Mao

Reply via email to