Re: Write performance of highly partitioned tables

Mao Liu Wed, 14 Jan 2026 01:33:48 -0800

Hi Jingsong,

We did a few attempts with the writer coordinator enabled.

One note is that the writer coordinator is not yet implemented for
StoreCompactOperator, so it is available for write jobs but not compaction
jobs. In my draft PR I have included changes to enable writer coordinator
for compaction.

We also found that the memory usage of the writer coordinator was
unexpectedly high. For a manifest directory of <1Gb, and writer coordinator
cache set to 10Gb, we still saw only ~30% cache hits and also eventually
job master heap OOM. In addition, we continued to observe manifest being
fetched thousands of times, originating from the JM instead of the TMs.
Given that there is only a single coordinator on the JM responding to many
TMs, the observed performance was worse than without the write coordinator…
(though this bottleneck may be limited only when simultaneously writing
to/compacting thousands of partitions at once)

I suspect the coordinated write restore class would benefit from
pre-fetching the manifest as well, but I’m not able to explain the very
high memory usage for the existing manifest cache implementation.

Many thanks
Mao

On Wed, 14 Jan 2026 at 13:27, Jingsong Li <[email protected]> wrote:

> Hi Mao,
>
> Why not use `sink.writer-coordinator.enabled`?
>
> Best,
> Jingsong
>
> On Tue, Jan 13, 2026 at 7:51 PM Mao Liu <[email protected]> wrote:
> >
> > Greetings Paimon community and devs,
> >
> > I’d like to share some findings from recent performance testing of
> Paimon in Flink, on highly partitioned tables with large data volumes.
> >
> > For PK tables with thousands of partitions and fixed bucket, where all
> partitions are receiving streaming writes, we have observed that:
> > - manifest files are fetched from cloud filesystem rather excessively,
> 100,000s of times or even more
> > - manifest fetching can be computationally intensive due to
> decompression & deserialization, leading to Flink TaskManager CPU
> utilization pinned at 100% until manifest fetching are all completed
> > - this causes very long run times for dedicated compaction jobs,
> unstable streaming job write performance, as well as very high API request
> costs to cloud filesystems for all the manifest file retrievals
> >
> > After a lot of logging and debugging, the bottleneck appears to be
> FileSystemWriteRestore.restoreFiles, which repeatedly fetches the manifest
> for each (partition, bucket) combination, filtering down to only the
> relevant data files for its own slice of (partition, bucket).
> >
> > We have been testing a patch on FileSystemWriteRestore, where we are
> pre-fetching and caching the entire manifest to avoid duplicated API
> requests and reduce computational burden of repeated
> decompression/deserialization.
> >
> > Draft PR for discussion: https://github.com/apache/paimon/pull/7031
> > Github issue with some further details:
> https://github.com/apache/paimon/issues/7030
> >
> > I'd like to get some feedback from Paimon maintainers/devs on
> > - whether this is an acceptable approach / suggestions for alternative
> implementation approaches
> > - are there any caveats/issues that this might cause (e.g. any risk that
> may lead to data loss?)
> >
> > Many thanks,
> > Mao
>

Re: Write performance of highly partitioned tables

Reply via email to