Re: Write performance of highly partitioned tables

Mao Liu Wed, 10 Jun 2026 02:07:52 -0700

Hi Jingsong,

I'd like to provide an update on this old thread.


I have had some time to further investigate the writer coordinator
approach, and the memory problems we initially observed on Paimon 1.3.1. I
was able to reproduce the memory spike using a benchmark test on a highly
partitioned table, and identified a few cache tuning configurations that
can make it viable in our situation.

It was a pleasant surprise that one of the major causes of the memory spike
issue was already fixed on 1.4/master (
https://github.com/apache/paimon/pull/6355). I have raised some PRs to
resolve the remaining issues:

https://github.com/apache/paimon/pull/8186
- Adds config option to use strong references in the cache, avoiding a
cache thrash spiral where GC and cache loads are in contention
- Adds config option to prefetch the entire manifest on the writer
coordinator, avoiding memory spikes in jobs with high parallelism due to
many tasks simultaneously requesting manifest entries from the job manager
- Also includes benchmark tests for reading from the manifest cache. Even
though this no longer reveals problematic symptoms, the benchmark could
still be useful for testing against future regressions.

https://github.com/apache/paimon/pull/8128
- integrates writer coordinator for compaction job - already merged, thank
you for the review!

I’m happy to report the combination of these changes has worked well for
our highly partitioned table.

Many thanks,
Mao


On Wed, 14 Jan 2026 at 20:33, Mao Liu <[email protected]> wrote:

> Hi Jingsong,
>
> We did a few attempts with the writer coordinator enabled.
>
> One note is that the writer coordinator is not yet implemented for
> StoreCompactOperator, so it is available for write jobs but not compaction
> jobs. In my draft PR I have included changes to enable writer coordinator
> for compaction.
>
> We also found that the memory usage of the writer coordinator was
> unexpectedly high. For a manifest directory of <1Gb, and writer coordinator
> cache set to 10Gb, we still saw only ~30% cache hits and also eventually
> job master heap OOM. In addition, we continued to observe manifest being
> fetched thousands of times, originating from the JM instead of the TMs.
> Given that there is only a single coordinator on the JM responding to many
> TMs, the observed performance was worse than without the write coordinator…
> (though this bottleneck may be limited only when simultaneously writing
> to/compacting thousands of partitions at once)
>
> I suspect the coordinated write restore class would benefit from
> pre-fetching the manifest as well, but I’m not able to explain the very
> high memory usage for the existing manifest cache implementation.
>
> Many thanks
> Mao
>
> On Wed, 14 Jan 2026 at 13:27, Jingsong Li <[email protected]> wrote:
>
>> Hi Mao,
>>
>> Why not use `sink.writer-coordinator.enabled`?
>>
>> Best,
>> Jingsong
>>
>> On Tue, Jan 13, 2026 at 7:51 PM Mao Liu <[email protected]> wrote:
>> >
>> > Greetings Paimon community and devs,
>> >
>> > I’d like to share some findings from recent performance testing of
>> Paimon in Flink, on highly partitioned tables with large data volumes.
>> >
>> > For PK tables with thousands of partitions and fixed bucket, where all
>> partitions are receiving streaming writes, we have observed that:
>> > - manifest files are fetched from cloud filesystem rather excessively,
>> 100,000s of times or even more
>> > - manifest fetching can be computationally intensive due to
>> decompression & deserialization, leading to Flink TaskManager CPU
>> utilization pinned at 100% until manifest fetching are all completed
>> > - this causes very long run times for dedicated compaction jobs,
>> unstable streaming job write performance, as well as very high API request
>> costs to cloud filesystems for all the manifest file retrievals
>> >
>> > After a lot of logging and debugging, the bottleneck appears to be
>> FileSystemWriteRestore.restoreFiles, which repeatedly fetches the manifest
>> for each (partition, bucket) combination, filtering down to only the
>> relevant data files for its own slice of (partition, bucket).
>> >
>> > We have been testing a patch on FileSystemWriteRestore, where we are
>> pre-fetching and caching the entire manifest to avoid duplicated API
>> requests and reduce computational burden of repeated
>> decompression/deserialization.
>> >
>> > Draft PR for discussion: https://github.com/apache/paimon/pull/7031
>> > Github issue with some further details:
>> https://github.com/apache/paimon/issues/7030
>> >
>> > I'd like to get some feedback from Paimon maintainers/devs on
>> > - whether this is an acceptable approach / suggestions for alternative
>> implementation approaches
>> > - are there any caveats/issues that this might cause (e.g. any risk
>> that may lead to data loss?)
>> >
>> > Many thanks,
>> > Mao
>>
>

Re: Write performance of highly partitioned tables

Reply via email to