Thanks Mao for your contribution! Best, Jingsong
On Wed, Jun 10, 2026 at 5:07 PM Mao Liu <[email protected]> wrote: > > Hi Jingsong, > > I'd like to provide an update on this old thread. > > I have had some time to further investigate the writer coordinator approach, > and the memory problems we initially observed on Paimon 1.3.1. I was able to > reproduce the memory spike using a benchmark test on a highly partitioned > table, and identified a few cache tuning configurations that can make it > viable in our situation. > > It was a pleasant surprise that one of the major causes of the memory spike > issue was already fixed on 1.4/master > (https://github.com/apache/paimon/pull/6355). I have raised some PRs to > resolve the remaining issues: > > https://github.com/apache/paimon/pull/8186 > - Adds config option to use strong references in the cache, avoiding a cache > thrash spiral where GC and cache loads are in contention > - Adds config option to prefetch the entire manifest on the writer > coordinator, avoiding memory spikes in jobs with high parallelism due to many > tasks simultaneously requesting manifest entries from the job manager > - Also includes benchmark tests for reading from the manifest cache. Even > though this no longer reveals problematic symptoms, the benchmark could still > be useful for testing against future regressions. > > https://github.com/apache/paimon/pull/8128 > - integrates writer coordinator for compaction job - already merged, thank > you for the review! > > I’m happy to report the combination of these changes has worked well for our > highly partitioned table. > > Many thanks, > Mao > > > On Wed, 14 Jan 2026 at 20:33, Mao Liu <[email protected]> wrote: >> >> Hi Jingsong, >> >> We did a few attempts with the writer coordinator enabled. >> >> One note is that the writer coordinator is not yet implemented for >> StoreCompactOperator, so it is available for write jobs but not compaction >> jobs. In my draft PR I have included changes to enable writer coordinator >> for compaction. >> >> We also found that the memory usage of the writer coordinator was >> unexpectedly high. For a manifest directory of <1Gb, and writer coordinator >> cache set to 10Gb, we still saw only ~30% cache hits and also eventually job >> master heap OOM. In addition, we continued to observe manifest being fetched >> thousands of times, originating from the JM instead of the TMs. Given that >> there is only a single coordinator on the JM responding to many TMs, the >> observed performance was worse than without the write coordinator… (though >> this bottleneck may be limited only when simultaneously writing >> to/compacting thousands of partitions at once) >> >> I suspect the coordinated write restore class would benefit from >> pre-fetching the manifest as well, but I’m not able to explain the very high >> memory usage for the existing manifest cache implementation. >> >> Many thanks >> Mao >> >> On Wed, 14 Jan 2026 at 13:27, Jingsong Li <[email protected]> wrote: >>> >>> Hi Mao, >>> >>> Why not use `sink.writer-coordinator.enabled`? >>> >>> Best, >>> Jingsong >>> >>> On Tue, Jan 13, 2026 at 7:51 PM Mao Liu <[email protected]> wrote: >>> > >>> > Greetings Paimon community and devs, >>> > >>> > I’d like to share some findings from recent performance testing of Paimon >>> > in Flink, on highly partitioned tables with large data volumes. >>> > >>> > For PK tables with thousands of partitions and fixed bucket, where all >>> > partitions are receiving streaming writes, we have observed that: >>> > - manifest files are fetched from cloud filesystem rather excessively, >>> > 100,000s of times or even more >>> > - manifest fetching can be computationally intensive due to decompression >>> > & deserialization, leading to Flink TaskManager CPU utilization pinned at >>> > 100% until manifest fetching are all completed >>> > - this causes very long run times for dedicated compaction jobs, unstable >>> > streaming job write performance, as well as very high API request costs >>> > to cloud filesystems for all the manifest file retrievals >>> > >>> > After a lot of logging and debugging, the bottleneck appears to be >>> > FileSystemWriteRestore.restoreFiles, which repeatedly fetches the >>> > manifest for each (partition, bucket) combination, filtering down to only >>> > the relevant data files for its own slice of (partition, bucket). >>> > >>> > We have been testing a patch on FileSystemWriteRestore, where we are >>> > pre-fetching and caching the entire manifest to avoid duplicated API >>> > requests and reduce computational burden of repeated >>> > decompression/deserialization. >>> > >>> > Draft PR for discussion: https://github.com/apache/paimon/pull/7031 >>> > Github issue with some further details: >>> > https://github.com/apache/paimon/issues/7030 >>> > >>> > I'd like to get some feedback from Paimon maintainers/devs on >>> > - whether this is an acceptable approach / suggestions for alternative >>> > implementation approaches >>> > - are there any caveats/issues that this might cause (e.g. any risk that >>> > may lead to data loss?) >>> > >>> > Many thanks, >>> > Mao
