Re: Write performance of highly partitioned tables

Jingsong Li Wed, 10 Jun 2026 02:16:46 -0700

Thanks Mao for your contribution!

Best,
Jingsong


On Wed, Jun 10, 2026 at 5:07 PM Mao Liu <[email protected]> wrote:
>
> Hi Jingsong,
>
> I'd like to provide an update on this old thread.
>
> I have had some time to further investigate the writer coordinator approach, 
> and the memory problems we initially observed on Paimon 1.3.1. I was able to 
> reproduce the memory spike using a benchmark test on a highly partitioned 
> table, and identified a few cache tuning configurations that can make it 
> viable in our situation.
>
> It was a pleasant surprise that one of the major causes of the memory spike 
> issue was already fixed on 1.4/master 
> (https://github.com/apache/paimon/pull/6355). I have raised some PRs to 
> resolve the remaining issues:
>
> https://github.com/apache/paimon/pull/8186
> - Adds config option to use strong references in the cache, avoiding a cache 
> thrash spiral where GC and cache loads are in contention
> - Adds config option to prefetch the entire manifest on the writer 
> coordinator, avoiding memory spikes in jobs with high parallelism due to many 
> tasks simultaneously requesting manifest entries from the job manager
> - Also includes benchmark tests for reading from the manifest cache. Even 
> though this no longer reveals problematic symptoms, the benchmark could still 
> be useful for testing against future regressions.
>
> https://github.com/apache/paimon/pull/8128
> - integrates writer coordinator for compaction job - already merged, thank 
> you for the review!
>
> I’m happy to report the combination of these changes has worked well for our 
> highly partitioned table.
>
> Many thanks,
> Mao
>
>
> On Wed, 14 Jan 2026 at 20:33, Mao Liu <[email protected]> wrote:
>>
>> Hi Jingsong,
>>
>> We did a few attempts with the writer coordinator enabled.
>>
>> One note is that the writer coordinator is not yet implemented for 
>> StoreCompactOperator, so it is available for write jobs but not compaction 
>> jobs. In my draft PR I have included changes to enable writer coordinator 
>> for compaction.
>>
>> We also found that the memory usage of the writer coordinator was 
>> unexpectedly high. For a manifest directory of <1Gb, and writer coordinator 
>> cache set to 10Gb, we still saw only ~30% cache hits and also eventually job 
>> master heap OOM. In addition, we continued to observe manifest being fetched 
>> thousands of times, originating from the JM instead of the TMs. Given that 
>> there is only a single coordinator on the JM responding to many TMs, the 
>> observed performance was worse than without the write coordinator… (though 
>> this bottleneck may be limited only when simultaneously writing 
>> to/compacting thousands of partitions at once)
>>
>> I suspect the coordinated write restore class would benefit from 
>> pre-fetching the manifest as well, but I’m not able to explain the very high 
>> memory usage for the existing manifest cache implementation.
>>
>> Many thanks
>> Mao
>>
>> On Wed, 14 Jan 2026 at 13:27, Jingsong Li <[email protected]> wrote:
>>>
>>> Hi Mao,
>>>
>>> Why not use `sink.writer-coordinator.enabled`?
>>>
>>> Best,
>>> Jingsong
>>>
>>> On Tue, Jan 13, 2026 at 7:51 PM Mao Liu <[email protected]> wrote:
>>> >
>>> > Greetings Paimon community and devs,
>>> >
>>> > I’d like to share some findings from recent performance testing of Paimon 
>>> > in Flink, on highly partitioned tables with large data volumes.
>>> >
>>> > For PK tables with thousands of partitions and fixed bucket, where all 
>>> > partitions are receiving streaming writes, we have observed that:
>>> > - manifest files are fetched from cloud filesystem rather excessively, 
>>> > 100,000s of times or even more
>>> > - manifest fetching can be computationally intensive due to decompression 
>>> > & deserialization, leading to Flink TaskManager CPU utilization pinned at 
>>> > 100% until manifest fetching are all completed
>>> > - this causes very long run times for dedicated compaction jobs, unstable 
>>> > streaming job write performance, as well as very high API request costs 
>>> > to cloud filesystems for all the manifest file retrievals
>>> >
>>> > After a lot of logging and debugging, the bottleneck appears to be 
>>> > FileSystemWriteRestore.restoreFiles, which repeatedly fetches the 
>>> > manifest for each (partition, bucket) combination, filtering down to only 
>>> > the relevant data files for its own slice of (partition, bucket).
>>> >
>>> > We have been testing a patch on FileSystemWriteRestore, where we are 
>>> > pre-fetching and caching the entire manifest to avoid duplicated API 
>>> > requests and reduce computational burden of repeated 
>>> > decompression/deserialization.
>>> >
>>> > Draft PR for discussion: https://github.com/apache/paimon/pull/7031
>>> > Github issue with some further details: 
>>> > https://github.com/apache/paimon/issues/7030
>>> >
>>> > I'd like to get some feedback from Paimon maintainers/devs on
>>> > - whether this is an acceptable approach / suggestions for alternative 
>>> > implementation approaches
>>> > - are there any caveats/issues that this might cause (e.g. any risk that 
>>> > may lead to data loss?)
>>> >
>>> > Many thanks,
>>> > Mao

Re: Write performance of highly partitioned tables

Reply via email to