Greetings Paimon community and devs, I’d like to share some findings from recent performance testing of Paimon in Flink, on highly partitioned tables with large data volumes.
For PK tables with thousands of partitions and fixed bucket, where all partitions are receiving streaming writes, we have observed that: - manifest files are fetched from cloud filesystem rather excessively, 100,000s of times or even more - manifest fetching can be computationally intensive due to decompression & deserialization, leading to Flink TaskManager CPU utilization pinned at 100% until manifest fetching are all completed - this causes very long run times for dedicated compaction jobs, unstable streaming job write performance, as well as very high API request costs to cloud filesystems for all the manifest file retrievals After a lot of logging and debugging, the bottleneck appears to be FileSystemWriteRestore.restoreFiles <https://github.com/apache/paimon/blob/release-1.3/paimon-core/src/main/java/org/apache/paimon/operation/FileSystemWriteRestore.java#L80>, which repeatedly fetches the manifest for each (partition, bucket) combination, filtering down to only the relevant data files for its own slice of (partition, bucket). We have been testing a patch on FileSystemWriteRestore, where we are pre-fetching and caching the entire manifest to avoid duplicated API requests and reduce computational burden of repeated decompression/deserialization. Draft PR for discussion: https://github.com/apache/paimon/pull/7031 Github issue with some further details: https://github.com/apache/paimon/issues/7030 I'd like to get some feedback from Paimon maintainers/devs on - whether this is an acceptable approach / suggestions for alternative implementation approaches - are there any caveats/issues that this might cause (e.g. any risk that may lead to data loss?) Many thanks, Mao
